Table of Contents

In [1]:
# Getting current date and time using now().
 
# importing datetime module for now()
import datetime
 
# using now() to get current time
current_time = datetime.datetime.now()
 
# Printing value of now.
print("Time now is:", current_time)
Time now is: 2023-03-20 19:37:03.370259

1. Introduction

The problem statement revolves around the detection of fraud in auto insurance claims, which is a critical issue faced by many general insurance companies. The fraudulent claims not only result in significant leakages for the insurer but also affect innocent people's lives. The insurance frauds can be classified based on their sources, including policyholders, intermediaries, and internal factors, and their nature, such as application, inflation, identity, fabrication, and staged accidents. Detecting and preventing frauds in auto insurance claims require a robust analytical and modelling framework that can predict the likelihood of fraud before processing the claims. The framework should also be capable of identifying the hidden patterns in the data that lead to fraudulent claims.

We are expected to perform exploratory data analysis, report the results from learning curves, build an analytical framework to predict fraud, and extract the top 20 patterns for fraudulent claims using decision tree algorithms only. This analytical and modelling framework can benefit not only the insurance companies but also the regulatory bodies and law enforcement agencies.

Objectives :

  • To perform exploratory Data Analysis using visualizations
  • To report the results/observations from learning curves
  • To build the analytical framework to predict if a claim is fraudulent or not
  • Finding Missing Data
  • To extract the top 20 patterns for fraudulent claims using Decision Tree Algorithm

2. Importing the Required Libraries

We import various packages available in python to make our tasks easy.

In [2]:
import os # Provides functions for creating and removing a directory (folder), fetching its contents, changing and identifying the current directory, etc. 
import pandas as pd # To perform Operations on DataFrames.
import numpy as np # Perform a number of mathematical operations on arrays such as statistical, and algebraic.
import matplotlib.pyplot as plt

import missingno as msno
In [3]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

3. Loading the Data into Pandas Dataframes

We are provided with multiple CSV files, which are to be merged and used for Model Building.

Attributes Information Provided:

Demographics Data:

  • CustomerID : Customer ID
  • InsuredAge : age
  • InsuredZipCode : Zip Code
  • InsuredGender : Gender - the missing value is represented as “NA”
  • InsuredEducationLevel : Education
  • InsuredOccupation : Occupation
  • InsuredHobbies : Hobbies
  • CapitalGains : Capital gains(Financial Status)
  • CapitalLoss : capital loss(Financial Status)
  • Country : Country

Policy Information:

  • CustomerID : Customer ID
  • CustomerLoyaltyPeriod : Duration of customer relationship
  • InsurancePolicyNumber : policy number
  • DateOfPolicyCoverage : policy commencement date
  • InsurancePolicyState : Policy location (State)
  • Policy_CombinedSingleLimit : Split Limit and Combined Single Limit
  • Policy_Deductible : Deductible amount
  • PolicyAnnualPremium : Annual Premium – the missing value is represented as “-1”
  • UmbrellaLimit : Umbrella Limit amount
  • InsuredRelationship : Realtionship

Claim Information:

  • CustomerID : Customer ID
  • DateOfIncident : Date of incident
  • TypeOfIncident : Type of incident
  • TypeOfCollission : Type of Collision - “?” is the missing value
  • SeverityOfIncident : Collision severity
  • AuthoritiesContacted : Which authorities are contacted
  • IncidentState : Incident location (State)
  • IncidentCity : Incident location (City)
  • IncidentAddress : Incident location (address)
  • IncidentTime : time of incident – Hour of the day - the missing value is represented as “-5
  • NumberOfVehicles : Number of vehicles involved
  • PropertyDamage : If property damage is there - “?” is the missing value
  • BodilyInjuries : Number of bodily injuries
  • Witnesses : Number of witnesses - missing value is represented as "MISSINGVALUE"
  • PoliceReport : If police report available - “?” is the missing value
  • AmountOfTotalClaim : Total claim amount - the missing value is represented as “MISSEDDATA”
  • AmountOfInjuryClaim : Claim for injury
  • AmountOfPropertyClaim : claim for property damage
  • AmountOfVehicleDamage : claim for vehicle damage

Data of Vehicle:

  • CustomerID : Customer ID
  • VehicleAttribute : Service signed for
  • VehicleAttributeDetails : Value of the vehicle attribute - the missing value is represented as “???”

Fraud Data :

  • CustomerID : Customer ID
  • ReportedFraud : Fraud or not – Target

3.1 Loading Train Data

We are provided with 5 different files for Training Data.

Files are:

  • Train_Demographics.csv
  • Train_Policy.csv
  • Train_Claim.csv
  • Train_Vehicle.csv
  • Traindata_with_Target.csv

Note:

  • The data provided, comes with special missing value notations as metioned in the attribute description.
In [4]:
train_demographic = pd.read_csv("../Train Data/Train_Demographics.csv",na_values=['NA'])
train_policy = pd.read_csv("../Train Data/Train_Policy.csv",na_values=['NA', '-1', 'MISSINGVAL'])
train_claim = pd.read_csv("../Train Data/Train_Claim.csv",na_values=['?', '-5', 'MISSINGVALUE', 'MISSEDDATA'])
train_vehicle = pd.read_csv("../Train Data/Train_Vehicle.csv" ,na_values=['???'])
train_target = pd.read_csv("../Train Data/Traindata_with_Target.csv")
In [5]:
# print the shapes of the dataframes
print('Shape of train_demographic:', train_demographic.shape)
print('Shape of train_policy:', train_policy.shape)
print('Shape of train_claim:', train_claim.shape)
print('Shape of train_vehicle:', train_vehicle.shape)
print('Shape of train_target:', train_target.shape)
Shape of train_demographic: (28836, 10)
Shape of train_policy: (28836, 10)
Shape of train_claim: (28836, 19)
Shape of train_vehicle: (115344, 3)
Shape of train_target: (28836, 2)
In [6]:
# print the shapes of the dataframes
print('columns of train_demographic:', train_demographic.columns)
print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
print('columns of train_policy:', train_policy.columns)
print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
print('columns of train_claim:', train_claim.columns)
print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
print('columns of train_vehicle:', train_vehicle.columns)
print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
print('columns of train_target:', train_target.columns)
columns of train_demographic: Index(['CustomerID', 'InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'Country'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
columns of train_policy: Index(['InsurancePolicyNumber', 'CustomerLoyaltyPeriod', 'DateOfPolicyCoverage', 'InsurancePolicyState', 'Policy_CombinedSingleLimit', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'CustomerID'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
columns of train_claim: Index(['CustomerID', 'DateOfIncident', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
columns of train_vehicle: Index(['CustomerID', 'VehicleAttribute', 'VehicleAttributeDetails'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
columns of train_target: Index(['CustomerID', 'ReportedFraud'], dtype='object')

3.1.1 Reshaping the "train_vehicle" DataFrame

In this datarame for every "CustomerID" there are four different values in the subsequent columns. We need to transform the unique values in "VehicleAttribute" column to individual columns and assign their respective value from "VehicleAttributeDetails" column.

In [7]:
train_vehicle.head()
Out[7]:
CustomerID VehicleAttribute VehicleAttributeDetails
0 Cust20179 VehicleID Vehicle8898
1 Cust21384 VehicleModel Malibu
2 Cust33335 VehicleMake Toyota
3 Cust27118 VehicleModel Neon
4 Cust13038 VehicleID Vehicle30212
In [8]:
# pivot the dataframe to get unique values of VehicleAttribute as columns and VehicleAttributeDetails as the values
train_vehicle = train_vehicle.pivot_table(index='CustomerID', columns='VehicleAttribute', values='VehicleAttributeDetails', aggfunc='first')
train_vehicle.head()
Out[8]:
VehicleAttribute VehicleID VehicleMake VehicleModel VehicleYOM
CustomerID
Cust10000 Vehicle26917 Audi A5 2008
Cust10001 Vehicle15893 Audi A5 2006
Cust10002 Vehicle5152 Volkswagen Jetta 1999
Cust10003 Vehicle37363 Volkswagen Jetta 2003
Cust10004 Vehicle28633 Toyota CRV 2010
In [9]:
# removing the VehicleAttribute as index
train_vehicle = train_vehicle.rename_axis(None, axis=1)
# reset the index to convert the pivot table to a regular dataframe
train_vehicle = train_vehicle.reset_index()
# print the resulting head of the dataframe and its shape
display(train_vehicle.head(10))
print("\n")
print('Shape of train_vehicle:', train_vehicle.shape)
CustomerID VehicleID VehicleMake VehicleModel VehicleYOM
0 Cust10000 Vehicle26917 Audi A5 2008
1 Cust10001 Vehicle15893 Audi A5 2006
2 Cust10002 Vehicle5152 Volkswagen Jetta 1999
3 Cust10003 Vehicle37363 Volkswagen Jetta 2003
4 Cust10004 Vehicle28633 Toyota CRV 2010
5 Cust10005 Vehicle26409 Toyota CRV 2011
6 Cust10006 Vehicle12114 Mercedes C300 2000
7 Cust10007 Vehicle26987 Suburu C300 2010
8 Cust10009 Vehicle12490 Volkswagen Passat 1995
9 Cust1001 Vehicle28516 Saab 92x 2004
Shape of train_vehicle: (28836, 5)

3.1.2 Merging the train DataFrames

Joining all the datasets into one DataFrame namely "train_df" using the merge function from pandas library.

In [ ]:
 
In [10]:
# print the shapes of the dataframes
print('Shape of train_demographic:', train_demographic.shape)
print('Shape of train_policy:', train_policy.shape)
print('Shape of train_claim:', train_claim.shape)
print('Shape of train_vehicle:', train_vehicle.shape)
print('Shape of train_target:', train_target.shape)
Shape of train_demographic: (28836, 10)
Shape of train_policy: (28836, 10)
Shape of train_claim: (28836, 19)
Shape of train_vehicle: (28836, 5)
Shape of train_target: (28836, 2)
In [11]:
# merge the dataframes based on the CustomerID column
merged_df = pd.merge(train_demographic, train_policy, on='CustomerID')
merged_df = pd.merge(merged_df, train_claim, on='CustomerID')
merged_df = pd.merge(merged_df, train_vehicle, on='CustomerID')
train_df = pd.merge(merged_df, train_target, on='CustomerID')
display(train_df.head(10))
print("\n")
print('Shape of train_df:', train_df.shape)
CustomerID InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss Country InsurancePolicyNumber CustomerLoyaltyPeriod DateOfPolicyCoverage InsurancePolicyState Policy_CombinedSingleLimit Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship DateOfIncident TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleID VehicleMake VehicleModel VehicleYOM ReportedFraud
0 Cust10000 35 454776 MALE JD armed-forces movies 56700 -48500 India 119121 49 25-10-1998 State1 100/300 1000 1632.73 0 not-in-family 03-02-2015 Multi-vehicle Collision Side Collision Total Loss Police State7 City1 Location 1311 17.0 3 NaN 1 0.0 NaN 65501.0 13417 6071 46013 Vehicle26917 Audi A5 2008 N
1 Cust10001 36 454776 MALE JD tech-support cross-fit 70600 -48500 India 119122 114 15-11-2000 State1 100/300 1000 1255.19 0 not-in-family 02-02-2015 Multi-vehicle Collision Side Collision Total Loss Police State7 City5 Location 1311 10.0 3 YES 2 1.0 YES 61382.0 15560 5919 39903 Vehicle15893 Audi A5 2006 N
2 Cust10002 33 603260 MALE JD armed-forces polo 66400 -63700 India 119123 167 12-02-2001 State3 500/1000 617 1373.38 0 wife 15-01-2015 Single Vehicle Collision Side Collision Minor Damage Other State8 City6 Location 2081 22.0 1 YES 2 3.0 NO 66755.0 11630 11630 43495 Vehicle5152 Volkswagen Jetta 1999 N
3 Cust10003 36 474848 MALE JD armed-forces polo 47900 -73400 India 119124 190 11-04-2005 State2 500/1000 722 1337.60 0 own-child 19-01-2015 Single Vehicle Collision Side Collision Minor Damage Other State9 City6 Location 2081 22.0 1 YES 2 3.0 NO 66243.0 12003 12003 42237 Vehicle37363 Volkswagen Jetta 2003 N
4 Cust10004 29 457942 FEMALE High School exec-managerial dancing 0 -41500 India 119125 115 25-10-1996 State2 100/300 500 1353.73 4279863 unmarried 09-01-2015 Single Vehicle Collision Rear Collision Minor Damage Fire State8 City6 Location 1695 10.0 1 NO 2 1.0 YES 53544.0 8829 7234 37481 Vehicle28633 Toyota CRV 2010 N
5 Cust10005 28 457942 FEMALE High School exec-managerial dancing 0 -41500 India 119126 101 24-10-1999 State2 100/300 500 1334.49 3921366 unmarried 07-02-2015 Single Vehicle Collision Rear Collision Minor Damage Fire State7 City6 Location 1695 7.0 1 NO 1 2.0 NaN 53167.0 7818 8132 37217 Vehicle26409 Toyota CRV 2011 N
6 Cust10006 57 476456 MALE Masters adm-clerical sleeping 67400 0 India 119127 471 18-02-1995 State3 100/300 512 1214.78 165819 own-child 30-01-2015 Single Vehicle Collision Front Collision Minor Damage Ambulance State5 City4 Location 1440 20.0 1 NaN 0 2.0 NO 77453.0 6476 12822 58155 Vehicle12114 Mercedes C300 2000 N
7 Cust10007 49 476456 MALE Masters adm-clerical sleeping 67400 0 India 119128 340 22-02-1993 State3 100/300 877 1159.81 5282219 own-child 12-01-2015 Single Vehicle Collision Front Collision Minor Damage Police State5 City3 Location 1440 18.0 1 NaN 0 2.0 NO 60569.0 5738 7333 47498 Vehicle26987 Suburu C300 2010 N
8 Cust10009 27 432896 FEMALE High School handlers-cleaners camping 56400 -32800 India 119130 81 09-05-1998 State2 500/1000 2000 989.53 0 own-child 06-02-2015 Multi-vehicle Collision Front Collision Minor Damage Ambulance State9 City2 Location 1521 3.0 3 YES 0 0.0 NaN 67876.0 6788 7504 53584 Vehicle12490 Volkswagen Passat 1995 N
9 Cust1001 48 466132 MALE MD craft-repair sleeping 53300 0 India 110122 328 17-10-2014 State3 250/500 1000 1406.91 0 husband 25-01-2015 Single Vehicle Collision Side Collision Major Damage Police State7 City2 Location 1596 5.0 1 YES 1 2.0 YES 71610.0 6510 13020 52080 Vehicle28516 Saab 92x 2004 Y
Shape of train_df: (28836, 42)
In [ ]:
 

3.2 Loading Test Data

We are provided with 5 different files for Training Data.

Files are:

  • Test_Demographics.csv
  • Test_Policy.csv
  • Test_Claim.csv
  • Test_Vehicle.csv
  • Test.csv

Note:

  • The data provided, comes with special missing value notations as metioned in the attribute description.
In [12]:
test_demographic = pd.read_csv("../Test Data/Test_Demographics.csv",na_values=['NA'])
test_policy = pd.read_csv("../Test Data/Test_Policy.csv",na_values=['NA', '-1', 'MISSINGVAL'])
test_claim = pd.read_csv("../Test Data/Test_Claim.csv",na_values=['?', '-5', 'MISSINGVALUE', 'MISSEDDATA'])
test_vehicle = pd.read_csv("../Test Data/Test_Vehicle.csv" ,na_values=['???'])
test_target = pd.read_csv("../Test Data/Test.csv")
In [13]:
# print the shapes of the dataframes
print('Shape of test_demographic:', test_demographic.shape)
print('Shape of test_policy:', test_policy.shape)
print('Shape of test_claim:', test_claim.shape)
print('Shape of test_vehicle:', test_vehicle.shape)
print('Shape of test_target:', test_target.shape)
Shape of test_demographic: (8912, 10)
Shape of test_policy: (8912, 10)
Shape of test_claim: (8912, 19)
Shape of test_vehicle: (35648, 3)
Shape of test_target: (8912, 1)
In [14]:
# print the shapes of the dataframes
print('columns of test_demographic:', test_demographic.columns)
print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
print('columns of test_policy:', test_policy.columns)
print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
print('columns of test_claim:', test_claim.columns)
print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
print('columns of test_vehicle:', test_vehicle.columns)
print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
print('columns of test_target:', test_target.columns)
columns of test_demographic: Index(['CustomerID', 'InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'Country'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
columns of test_policy: Index(['InsurancePolicyNumber', 'CustomerLoyaltyPeriod', 'DateOfPolicyCoverage', 'InsurancePolicyState', 'Policy_CombinedSingleLimit', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'CustomerID'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
columns of test_claim: Index(['CustomerID', 'DateOfIncident', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
columns of test_vehicle: Index(['CustomerID', 'VehicleAttribute', 'VehicleAttributeDetails'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
columns of test_target: Index(['CustomerID'], dtype='object')

3.2.1 Reshaping the "test_vehicle" DataFrame

In this datarame for every "CustomerID" there are four different values in the subsequent columns. We need to transform the unique values in "VehicleAttribute" column to individual columns and assign their respective value from "VehicleAttributeDetails" column.

In [15]:
# pivot the dataframe to get unique values of VehicleAttribute as columns and VehicleAttributeDetails as the values
test_vehicle = test_vehicle.pivot_table(index='CustomerID', columns='VehicleAttribute', values='VehicleAttributeDetails', aggfunc='first')
test_vehicle.head()
Out[15]:
VehicleAttribute VehicleID VehicleMake VehicleModel VehicleYOM
CustomerID
Cust10008 Vehicle34362 Volkswagen Passat 1995
Cust10010 Vehicle17046 Nissan Ultima 2006
Cust10015 Vehicle11038 Suburu Impreza 2010
Cust10020 Vehicle37114 Accura TL 2009
Cust1003 Vehicle16771 Dodge RAM 2007
In [16]:
# removing the VehicleAttribute as index
test_vehicle = test_vehicle.rename_axis(None, axis=1)
# reset the index to convert the pivot table to a regular dataframe
test_vehicle = test_vehicle.reset_index()
# print the resulting head of the dataframe and its shape
display(test_vehicle.head(10))
print("\n")
print('Shape of train_vehicle:', test_vehicle.shape)
CustomerID VehicleID VehicleMake VehicleModel VehicleYOM
0 Cust10008 Vehicle34362 Volkswagen Passat 1995
1 Cust10010 Vehicle17046 Nissan Ultima 2006
2 Cust10015 Vehicle11038 Suburu Impreza 2010
3 Cust10020 Vehicle37114 Accura TL 2009
4 Cust1003 Vehicle16771 Dodge RAM 2007
5 Cust10033 Vehicle32962 Suburu Forrestor 2010
6 Cust10036 Vehicle29431 Dodge Neon 2012
7 Cust10038 Vehicle29796 Toyota Corolla 2006
8 Cust10039 Vehicle36221 Toyota Corolla 2007
9 Cust10045 Vehicle24190 Dodge Malibu 2006
Shape of train_vehicle: (8912, 5)

3.2.2 Merging the train DataFrames

Joining all the datasets into one DataFrame namely "train_df" using the merge function from pandas library.

In [17]:
# print the shapes of the dataframes
print('Shape of test_demographic:', test_demographic.shape)
print('Shape of test_policy:', test_policy.shape)
print('Shape of test_claim:', test_claim.shape)
print('Shape of test_vehicle:', test_vehicle.shape)
print('Shape of test_target:', test_target.shape)
Shape of test_demographic: (8912, 10)
Shape of test_policy: (8912, 10)
Shape of test_claim: (8912, 19)
Shape of test_vehicle: (8912, 5)
Shape of test_target: (8912, 1)
In [18]:
# merge the dataframes based on the CustomerID column
merged_df = pd.merge(test_demographic, test_policy, on='CustomerID')
merged_df = pd.merge(merged_df, test_claim, on='CustomerID')
merged_df = pd.merge(merged_df, test_vehicle, on='CustomerID')
test_df = pd.merge(merged_df, test_target, on='CustomerID')
display(test_df.head(10))
print("\n")
print('Shape of test_df:', test_df.shape)
CustomerID InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss Country InsurancePolicyNumber CustomerLoyaltyPeriod DateOfPolicyCoverage InsurancePolicyState Policy_CombinedSingleLimit Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship DateOfIncident TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleID VehicleMake VehicleModel VehicleYOM
0 Cust10008 27 471704 FEMALE High School adm-clerical base-jumping 56400 -57000 India 119129 84 1998-05-10 State2 500/1000 2000 1006.00 0 own-child 2015-02-05 Multi-vehicle Collision Front Collision Minor Damage Ambulance State5 City2 Location 1354 4.0 3 NO 0 0.0 NaN 68354.0 6835 8059 53460 Vehicle34362 Volkswagen Passat 1995
1 Cust10010 40 455810 FEMALE MD prof-specialty golf 56700 -65600 India 119131 232 2011-11-10 State3 100/300 500 1279.17 0 unmarried 2015-01-13 Single Vehicle Collision Rear Collision Minor Damage Fire State9 City5 Location 1383 16.0 1 NaN 1 1.0 NaN 55270.0 8113 5240 41917 Vehicle17046 Nissan Ultima 2006
2 Cust10015 39 461919 MALE JD other-service movies 30400 0 India 119136 218 2010-07-17 State2 250/500 1000 1454.67 1235986 other-relative 2015-01-05 Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 2030 20.0 3 NO 0 1.0 NaN 59515.0 7490 9110 42915 Vehicle11038 Suburu Impreza 2010
3 Cust10020 38 600904 FEMALE Masters exec-managerial video-games 68500 0 India 119141 205 2000-09-10 State3 500/500 2000 1287.76 5873212 wife 2015-01-03 Vehicle Theft NaN Trivial Damage None State7 City5 Location 1449 10.0 1 NaN 2 1.0 NaN 4941.0 494 866 3581 Vehicle37114 Accura TL 2009
4 Cust1003 29 430632 FEMALE PhD sales board-games 35100 0 India 110124 134 2000-09-06 State3 100/300 2000 1413.14 5000000 own-child 2015-02-22 Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 1916 7.0 3 NO 2 3.0 NO 34650.0 7700 3850 23100 Vehicle16771 Dodge RAM 2007
5 Cust10033 55 445339 MALE College tech-support sleeping 45700 0 India 119154 404 1994-03-17 State1 500/1000 2000 933.49 1435220 unmarried 2015-02-06 Multi-vehicle Collision Side Collision Minor Damage Police State5 City4 Location 1359 22.0 3 NaN 1 1.0 YES 82151.0 14692 9208 58251 Vehicle32962 Suburu Forrestor 2010
6 Cust10036 28 442142 FEMALE College craft-repair golf 0 -85900 India 119157 76 2005-10-05 State3 500/1000 766 1382.17 0 wife 2015-01-18 Single Vehicle Collision Rear Collision Major Damage Ambulance State7 City1 Location 1781 6.0 1 NaN 0 1.0 NaN 48775.0 5076 5076 38623 Vehicle29431 Dodge Neon 2012
7 Cust10038 39 460760 MALE JD other-service golf 0 -79400 India 119159 231 1996-05-19 State2 100/300 819 1361.74 1275838 not-in-family 2015-01-24 Single Vehicle Collision Side Collision Total Loss Fire State8 City3 Location 1158 14.0 1 NaN 0 3.0 NaN 51168.0 6084 9266 35818 Vehicle29796 Toyota Corolla 2006
8 Cust10039 33 460760 MALE JD priv-house-serv golf 63100 -79800 India 119160 147 2004-07-17 State2 100/300 1321 1318.66 3282427 not-in-family 2015-02-26 Single Vehicle Collision Side Collision Total Loss Fire State8 City3 Location 1158 15.0 1 NO 1 2.0 NaN 48761.0 7365 7263 34133 Vehicle36221 Toyota Corolla 2007
9 Cust10045 42 605258 FEMALE PhD adm-clerical reading 64800 -22300 India 119166 196 2001-11-28 State1 500/1000 1092 1136.23 0 other-relative 2015-01-09 Single Vehicle Collision Side Collision Minor Damage Police State7 City7 Location 1629 19.0 1 YES 1 3.0 NaN 77860.0 13554 10090 54216 Vehicle24190 Dodge Malibu 2006
Shape of test_df: (8912, 41)

4. Data Preprocessing

In this step, we remove or correct invalid, inaccurate, or irrelevant data. This includes identifying and handling missing data, removing duplicates, correcting errors and changing the columns to their appropriate data-type.

Steps for Data Pre-Processing:

  1. Decide numerical and categorical columns and change their respective data types.
  2. Check for missing values and duplicate values in the data.
  3. Check for cardinality in the columns.
  4. Decide X (Independent attributes or predictors) & y (Dependent variable).
  5. Do the Train-Test split.
     a. train_test_split
     b. **Outcome:**
             * X_train, X_test, y_train, y_test
  6. Separate categorical columns & numerical columns (X_train, X_test)
     a. **Outcome:**
             * X_train_cat, X_train_num, X_test_cat, X_test_num
  7. For Categorical columns:
     a. Impute missing values (Mode)
         * Fit on X_train_cat
         * Transform on X_train_cat & X_test_cat
     b. Label Encoding
         * Only for columns which have ordinality or the target column.
     c. One-hot Encoding
         * Only for columns with no ordinality or the target column.
         * Better at handling errors, or new values generated when compared to get_dummies.
         * Can do fit_transform directly.
  8. For Numerical columns:
     a. Impute missing values (Mean, Median)
         * Fit on X_train_num
         * Transform on X_train_num & X_test_num
     b. Standardization:
         * Fit on X_train_num
         * Transform on X_train_num & X_test_num
     c. Normalization:
         * Fit on X_train_num
         * Transform on X_train_num & X_test_num
  9. Reset the Index.
  10. Combine numerical data & categorical data.

4.1 Data Cleaning

We do the following steps in this part.

Steps:

  • Splitting the "Policy_CombinedSingleLimit" column
  • Dropping columns which contribute in poor performance of the model.
  • Checking weather there are any duplicate records in the data, if so remove them.
  • Handling missing data present in the data.

Note:

  • These steps are to be done for both, Train Data & Test Data

4.1.1 Splitting "Policy_CombinedSingleLimit" column

In this step, we split the "Policy_CombinedSingleLimit" column into "SplitLimit" & "CombinedSingleLimit" individual column and drop the original column.

4.1.1.1 Train Data

In [19]:
display(train_df.head())
CustomerID InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss Country InsurancePolicyNumber CustomerLoyaltyPeriod DateOfPolicyCoverage InsurancePolicyState Policy_CombinedSingleLimit Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship DateOfIncident TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleID VehicleMake VehicleModel VehicleYOM ReportedFraud
0 Cust10000 35 454776 MALE JD armed-forces movies 56700 -48500 India 119121 49 25-10-1998 State1 100/300 1000 1632.73 0 not-in-family 03-02-2015 Multi-vehicle Collision Side Collision Total Loss Police State7 City1 Location 1311 17.0 3 NaN 1 0.0 NaN 65501.0 13417 6071 46013 Vehicle26917 Audi A5 2008 N
1 Cust10001 36 454776 MALE JD tech-support cross-fit 70600 -48500 India 119122 114 15-11-2000 State1 100/300 1000 1255.19 0 not-in-family 02-02-2015 Multi-vehicle Collision Side Collision Total Loss Police State7 City5 Location 1311 10.0 3 YES 2 1.0 YES 61382.0 15560 5919 39903 Vehicle15893 Audi A5 2006 N
2 Cust10002 33 603260 MALE JD armed-forces polo 66400 -63700 India 119123 167 12-02-2001 State3 500/1000 617 1373.38 0 wife 15-01-2015 Single Vehicle Collision Side Collision Minor Damage Other State8 City6 Location 2081 22.0 1 YES 2 3.0 NO 66755.0 11630 11630 43495 Vehicle5152 Volkswagen Jetta 1999 N
3 Cust10003 36 474848 MALE JD armed-forces polo 47900 -73400 India 119124 190 11-04-2005 State2 500/1000 722 1337.60 0 own-child 19-01-2015 Single Vehicle Collision Side Collision Minor Damage Other State9 City6 Location 2081 22.0 1 YES 2 3.0 NO 66243.0 12003 12003 42237 Vehicle37363 Volkswagen Jetta 2003 N
4 Cust10004 29 457942 FEMALE High School exec-managerial dancing 0 -41500 India 119125 115 25-10-1996 State2 100/300 500 1353.73 4279863 unmarried 09-01-2015 Single Vehicle Collision Rear Collision Minor Damage Fire State8 City6 Location 1695 10.0 1 NO 2 1.0 YES 53544.0 8829 7234 37481 Vehicle28633 Toyota CRV 2010 N
In [20]:
# create two new columns by splitting the values column
train_df[['SplitLimit', 'CombinedSingleLimit']] = train_df['Policy_CombinedSingleLimit'].str.split('/', expand=True)

# convert the columns to the appropriate data types if necessary
train_df['SplitLimit'] = train_df['SplitLimit'].astype(int)
train_df['CombinedSingleLimit'] = train_df['CombinedSingleLimit'].astype(int)

# dropping the original column
train_df.drop("Policy_CombinedSingleLimit", axis=1, inplace=True)
In [21]:
display(train_df.head())
CustomerID InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss Country InsurancePolicyNumber CustomerLoyaltyPeriod DateOfPolicyCoverage InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship DateOfIncident TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleID VehicleMake VehicleModel VehicleYOM ReportedFraud SplitLimit CombinedSingleLimit
0 Cust10000 35 454776 MALE JD armed-forces movies 56700 -48500 India 119121 49 25-10-1998 State1 1000 1632.73 0 not-in-family 03-02-2015 Multi-vehicle Collision Side Collision Total Loss Police State7 City1 Location 1311 17.0 3 NaN 1 0.0 NaN 65501.0 13417 6071 46013 Vehicle26917 Audi A5 2008 N 100 300
1 Cust10001 36 454776 MALE JD tech-support cross-fit 70600 -48500 India 119122 114 15-11-2000 State1 1000 1255.19 0 not-in-family 02-02-2015 Multi-vehicle Collision Side Collision Total Loss Police State7 City5 Location 1311 10.0 3 YES 2 1.0 YES 61382.0 15560 5919 39903 Vehicle15893 Audi A5 2006 N 100 300
2 Cust10002 33 603260 MALE JD armed-forces polo 66400 -63700 India 119123 167 12-02-2001 State3 617 1373.38 0 wife 15-01-2015 Single Vehicle Collision Side Collision Minor Damage Other State8 City6 Location 2081 22.0 1 YES 2 3.0 NO 66755.0 11630 11630 43495 Vehicle5152 Volkswagen Jetta 1999 N 500 1000
3 Cust10003 36 474848 MALE JD armed-forces polo 47900 -73400 India 119124 190 11-04-2005 State2 722 1337.60 0 own-child 19-01-2015 Single Vehicle Collision Side Collision Minor Damage Other State9 City6 Location 2081 22.0 1 YES 2 3.0 NO 66243.0 12003 12003 42237 Vehicle37363 Volkswagen Jetta 2003 N 500 1000
4 Cust10004 29 457942 FEMALE High School exec-managerial dancing 0 -41500 India 119125 115 25-10-1996 State2 500 1353.73 4279863 unmarried 09-01-2015 Single Vehicle Collision Rear Collision Minor Damage Fire State8 City6 Location 1695 10.0 1 NO 2 1.0 YES 53544.0 8829 7234 37481 Vehicle28633 Toyota CRV 2010 N 100 300
In [22]:
print('Shape of train_df:', train_df.shape)
Shape of train_df: (28836, 43)

4.1.1.2 Test Data

In [23]:
display(test_df.head())
CustomerID InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss Country InsurancePolicyNumber CustomerLoyaltyPeriod DateOfPolicyCoverage InsurancePolicyState Policy_CombinedSingleLimit Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship DateOfIncident TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleID VehicleMake VehicleModel VehicleYOM
0 Cust10008 27 471704 FEMALE High School adm-clerical base-jumping 56400 -57000 India 119129 84 1998-05-10 State2 500/1000 2000 1006.00 0 own-child 2015-02-05 Multi-vehicle Collision Front Collision Minor Damage Ambulance State5 City2 Location 1354 4.0 3 NO 0 0.0 NaN 68354.0 6835 8059 53460 Vehicle34362 Volkswagen Passat 1995
1 Cust10010 40 455810 FEMALE MD prof-specialty golf 56700 -65600 India 119131 232 2011-11-10 State3 100/300 500 1279.17 0 unmarried 2015-01-13 Single Vehicle Collision Rear Collision Minor Damage Fire State9 City5 Location 1383 16.0 1 NaN 1 1.0 NaN 55270.0 8113 5240 41917 Vehicle17046 Nissan Ultima 2006
2 Cust10015 39 461919 MALE JD other-service movies 30400 0 India 119136 218 2010-07-17 State2 250/500 1000 1454.67 1235986 other-relative 2015-01-05 Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 2030 20.0 3 NO 0 1.0 NaN 59515.0 7490 9110 42915 Vehicle11038 Suburu Impreza 2010
3 Cust10020 38 600904 FEMALE Masters exec-managerial video-games 68500 0 India 119141 205 2000-09-10 State3 500/500 2000 1287.76 5873212 wife 2015-01-03 Vehicle Theft NaN Trivial Damage None State7 City5 Location 1449 10.0 1 NaN 2 1.0 NaN 4941.0 494 866 3581 Vehicle37114 Accura TL 2009
4 Cust1003 29 430632 FEMALE PhD sales board-games 35100 0 India 110124 134 2000-09-06 State3 100/300 2000 1413.14 5000000 own-child 2015-02-22 Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 1916 7.0 3 NO 2 3.0 NO 34650.0 7700 3850 23100 Vehicle16771 Dodge RAM 2007
In [24]:
# create two new columns by splitting the values column
test_df[['SplitLimit', 'CombinedSingleLimit']] = test_df['Policy_CombinedSingleLimit'].str.split('/', expand=True)

# convert the columns to the appropriate data types if necessary
test_df['SplitLimit'] = test_df['SplitLimit'].astype(int)
test_df['CombinedSingleLimit'] = test_df['CombinedSingleLimit'].astype(int)

# dropping the original column
test_df.drop("Policy_CombinedSingleLimit", axis=1, inplace=True)
In [25]:
display(test_df.head())
CustomerID InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss Country InsurancePolicyNumber CustomerLoyaltyPeriod DateOfPolicyCoverage InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship DateOfIncident TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleID VehicleMake VehicleModel VehicleYOM SplitLimit CombinedSingleLimit
0 Cust10008 27 471704 FEMALE High School adm-clerical base-jumping 56400 -57000 India 119129 84 1998-05-10 State2 2000 1006.00 0 own-child 2015-02-05 Multi-vehicle Collision Front Collision Minor Damage Ambulance State5 City2 Location 1354 4.0 3 NO 0 0.0 NaN 68354.0 6835 8059 53460 Vehicle34362 Volkswagen Passat 1995 500 1000
1 Cust10010 40 455810 FEMALE MD prof-specialty golf 56700 -65600 India 119131 232 2011-11-10 State3 500 1279.17 0 unmarried 2015-01-13 Single Vehicle Collision Rear Collision Minor Damage Fire State9 City5 Location 1383 16.0 1 NaN 1 1.0 NaN 55270.0 8113 5240 41917 Vehicle17046 Nissan Ultima 2006 100 300
2 Cust10015 39 461919 MALE JD other-service movies 30400 0 India 119136 218 2010-07-17 State2 1000 1454.67 1235986 other-relative 2015-01-05 Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 2030 20.0 3 NO 0 1.0 NaN 59515.0 7490 9110 42915 Vehicle11038 Suburu Impreza 2010 250 500
3 Cust10020 38 600904 FEMALE Masters exec-managerial video-games 68500 0 India 119141 205 2000-09-10 State3 2000 1287.76 5873212 wife 2015-01-03 Vehicle Theft NaN Trivial Damage None State7 City5 Location 1449 10.0 1 NaN 2 1.0 NaN 4941.0 494 866 3581 Vehicle37114 Accura TL 2009 500 500
4 Cust1003 29 430632 FEMALE PhD sales board-games 35100 0 India 110124 134 2000-09-06 State3 2000 1413.14 5000000 own-child 2015-02-22 Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 1916 7.0 3 NO 2 3.0 NO 34650.0 7700 3850 23100 Vehicle16771 Dodge RAM 2007 100 300
In [26]:
print('Shape of test_df:', test_df.shape)
Shape of test_df: (8912, 42)

4.1.2 Changing Data-types

In this step, we change the columns to their appropriate data-types as mentioned below.

Columns and their data-type:

  • CustomerID: Category
  • InsuredAge: int
  • InsuredZipCode: Category
  • InsuredGender: Category
  • InsuredEducationLevel: Category
  • InsuredOccupation:Category
  • InsuredHobbies: Category
  • CapitalGains: int
  • CapitalLoss: int
  • Country: Category
  • InsurancePolicyNumber: Category
  • CustomerLoyaltyPeriod: int
  • DateOfPolicyCoverage: date
  • InsurancePolicyState: Category
  • Policy_Deductible: int
  • PolicyAnnualPremium: float
  • UmbrellaLimit: int
  • InsuredRelationship: Category
  • DateOfIncident: date
  • TypeOfIncident: Category
  • TypeOfCollission: Category
  • SeverityOfIncident: Category
  • AuthoritiesContacted: Category
  • IncidentState: Category
  • IncidentCity: Category
  • IncidentAddress: Category
  • IncidentTime: float
  • NumberOfVehicles: int
  • PropertyDamage: Category
  • BodilyInjuries: int
  • Witnesses: int
  • PoliceReport: Category
  • AmountOfTotalClaim: float
  • AmountOfInjuryClaim: int
  • AmountOfPropertyClaim: int
  • AmountOfVehicleDamage: int
  • VehicleID: Category
  • VehicleMake: Category
  • VehicleModel: Category
  • VehicleYOM: date
  • ReportedFraud: Category
  • SplitLimit: int
  • CombinedSingleLimit: int

4.1.2.1 Train Data

In [27]:
train_df.dtypes
Out[27]:
CustomerID                object
InsuredAge                 int64
InsuredZipCode             int64
InsuredGender             object
InsuredEducationLevel     object
InsuredOccupation         object
InsuredHobbies            object
CapitalGains               int64
CapitalLoss                int64
Country                   object
InsurancePolicyNumber      int64
CustomerLoyaltyPeriod      int64
DateOfPolicyCoverage      object
InsurancePolicyState      object
Policy_Deductible          int64
PolicyAnnualPremium      float64
UmbrellaLimit              int64
InsuredRelationship       object
DateOfIncident            object
TypeOfIncident            object
TypeOfCollission          object
SeverityOfIncident        object
AuthoritiesContacted      object
IncidentState             object
IncidentCity              object
IncidentAddress           object
IncidentTime             float64
NumberOfVehicles           int64
PropertyDamage            object
BodilyInjuries             int64
Witnesses                float64
PoliceReport              object
AmountOfTotalClaim       float64
AmountOfInjuryClaim        int64
AmountOfPropertyClaim      int64
AmountOfVehicleDamage      int64
VehicleID                 object
VehicleMake               object
VehicleModel              object
VehicleYOM                object
ReportedFraud             object
SplitLimit                 int64
CombinedSingleLimit        int64
dtype: object
In [28]:
# function to convert columns of a dataframe into metioned data-type
def convert_columns_types_to_category(DataFrame, cols=None, col_type = None):
    display('### Before conversion: ###', DataFrame.dtypes) # Checking the Data Types of columns before converting
    DataFrame[cols] = DataFrame[cols].astype(col_type) # Changing the Data Types using astype() function
    display('### After conversion: ###', DataFrame.dtypes) # Checking the Data Types of columns before converting
    return DataFrame
In [29]:
# categorical converrsion

# getting categoric columns in ca_cols variable
cat_cols = ['CustomerID', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel',
           'InsuredOccupation', 'InsuredHobbies', 'Country', 'InsurancePolicyNumber',
           'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission',
           'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity',
           'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleID',
            'VehicleMake', 'VehicleModel', 'ReportedFraud']

# calling the convert_columns_types_to_category() function defined above
train_df = convert_columns_types_to_category(train_df, cols=cat_cols, col_type = 'category')
'### Before conversion: ###'
CustomerID                object
InsuredAge                 int64
InsuredZipCode             int64
InsuredGender             object
InsuredEducationLevel     object
InsuredOccupation         object
InsuredHobbies            object
CapitalGains               int64
CapitalLoss                int64
Country                   object
InsurancePolicyNumber      int64
CustomerLoyaltyPeriod      int64
DateOfPolicyCoverage      object
InsurancePolicyState      object
Policy_Deductible          int64
PolicyAnnualPremium      float64
UmbrellaLimit              int64
InsuredRelationship       object
DateOfIncident            object
TypeOfIncident            object
TypeOfCollission          object
SeverityOfIncident        object
AuthoritiesContacted      object
IncidentState             object
IncidentCity              object
IncidentAddress           object
IncidentTime             float64
NumberOfVehicles           int64
PropertyDamage            object
BodilyInjuries             int64
Witnesses                float64
PoliceReport              object
AmountOfTotalClaim       float64
AmountOfInjuryClaim        int64
AmountOfPropertyClaim      int64
AmountOfVehicleDamage      int64
VehicleID                 object
VehicleMake               object
VehicleModel              object
VehicleYOM                object
ReportedFraud             object
SplitLimit                 int64
CombinedSingleLimit        int64
dtype: object
'### After conversion: ###'
CustomerID               category
InsuredAge                  int64
InsuredZipCode           category
InsuredGender            category
InsuredEducationLevel    category
InsuredOccupation        category
InsuredHobbies           category
CapitalGains                int64
CapitalLoss                 int64
Country                  category
InsurancePolicyNumber    category
CustomerLoyaltyPeriod       int64
DateOfPolicyCoverage       object
InsurancePolicyState     category
Policy_Deductible           int64
PolicyAnnualPremium       float64
UmbrellaLimit               int64
InsuredRelationship      category
DateOfIncident             object
TypeOfIncident           category
TypeOfCollission         category
SeverityOfIncident       category
AuthoritiesContacted     category
IncidentState            category
IncidentCity             category
IncidentAddress          category
IncidentTime              float64
NumberOfVehicles            int64
PropertyDamage           category
BodilyInjuries              int64
Witnesses                 float64
PoliceReport             category
AmountOfTotalClaim        float64
AmountOfInjuryClaim         int64
AmountOfPropertyClaim       int64
AmountOfVehicleDamage       int64
VehicleID                category
VehicleMake              category
VehicleModel             category
VehicleYOM                 object
ReportedFraud            category
SplitLimit                  int64
CombinedSingleLimit         int64
dtype: object
In [30]:
# datetime converrsion

# usning the pandas to_datetime() function we conver the columns into date-time format
train_df['DateOfPolicyCoverage'] = pd.to_datetime(train_df['DateOfPolicyCoverage'], format='%d-%m-%Y')
train_df['DateOfIncident'] = pd.to_datetime(train_df['DateOfIncident'], format='%d-%m-%Y')
train_df['VehicleYOM'] = pd.to_datetime(train_df['VehicleYOM'], format='%Y')
In [31]:
train_df.dtypes
Out[31]:
CustomerID                     category
InsuredAge                        int64
InsuredZipCode                 category
InsuredGender                  category
InsuredEducationLevel          category
InsuredOccupation              category
InsuredHobbies                 category
CapitalGains                      int64
CapitalLoss                       int64
Country                        category
InsurancePolicyNumber          category
CustomerLoyaltyPeriod             int64
DateOfPolicyCoverage     datetime64[ns]
InsurancePolicyState           category
Policy_Deductible                 int64
PolicyAnnualPremium             float64
UmbrellaLimit                     int64
InsuredRelationship            category
DateOfIncident           datetime64[ns]
TypeOfIncident                 category
TypeOfCollission               category
SeverityOfIncident             category
AuthoritiesContacted           category
IncidentState                  category
IncidentCity                   category
IncidentAddress                category
IncidentTime                    float64
NumberOfVehicles                  int64
PropertyDamage                 category
BodilyInjuries                    int64
Witnesses                       float64
PoliceReport                   category
AmountOfTotalClaim              float64
AmountOfInjuryClaim               int64
AmountOfPropertyClaim             int64
AmountOfVehicleDamage             int64
VehicleID                      category
VehicleMake                    category
VehicleModel                   category
VehicleYOM               datetime64[ns]
ReportedFraud                  category
SplitLimit                        int64
CombinedSingleLimit               int64
dtype: object

4.1.2.2 Test Data

Now that the data-types of the train data are converted, we can follow the same steps in test data too.

In [32]:
test_df.dtypes
Out[32]:
CustomerID                object
InsuredAge                 int64
InsuredZipCode             int64
InsuredGender             object
InsuredEducationLevel     object
InsuredOccupation         object
InsuredHobbies            object
CapitalGains               int64
CapitalLoss                int64
Country                   object
InsurancePolicyNumber      int64
CustomerLoyaltyPeriod      int64
DateOfPolicyCoverage      object
InsurancePolicyState      object
Policy_Deductible          int64
PolicyAnnualPremium      float64
UmbrellaLimit              int64
InsuredRelationship       object
DateOfIncident            object
TypeOfIncident            object
TypeOfCollission          object
SeverityOfIncident        object
AuthoritiesContacted      object
IncidentState             object
IncidentCity              object
IncidentAddress           object
IncidentTime             float64
NumberOfVehicles           int64
PropertyDamage            object
BodilyInjuries             int64
Witnesses                float64
PoliceReport              object
AmountOfTotalClaim       float64
AmountOfInjuryClaim        int64
AmountOfPropertyClaim      int64
AmountOfVehicleDamage      int64
VehicleID                 object
VehicleMake               object
VehicleModel              object
VehicleYOM                object
SplitLimit                 int64
CombinedSingleLimit        int64
dtype: object
In [33]:
# categorical converrsion

# getting categoric columns in ca_cols variable
cat_cols = ['CustomerID', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel',
           'InsuredOccupation', 'InsuredHobbies', 'Country', 'InsurancePolicyNumber',
           'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission',
           'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity',
           'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleID',
            'VehicleMake', 'VehicleModel']

# calling the convert_columns_types_to_category() function defined above
test_df = convert_columns_types_to_category(test_df, cols=cat_cols, col_type = 'category')
'### Before conversion: ###'
CustomerID                object
InsuredAge                 int64
InsuredZipCode             int64
InsuredGender             object
InsuredEducationLevel     object
InsuredOccupation         object
InsuredHobbies            object
CapitalGains               int64
CapitalLoss                int64
Country                   object
InsurancePolicyNumber      int64
CustomerLoyaltyPeriod      int64
DateOfPolicyCoverage      object
InsurancePolicyState      object
Policy_Deductible          int64
PolicyAnnualPremium      float64
UmbrellaLimit              int64
InsuredRelationship       object
DateOfIncident            object
TypeOfIncident            object
TypeOfCollission          object
SeverityOfIncident        object
AuthoritiesContacted      object
IncidentState             object
IncidentCity              object
IncidentAddress           object
IncidentTime             float64
NumberOfVehicles           int64
PropertyDamage            object
BodilyInjuries             int64
Witnesses                float64
PoliceReport              object
AmountOfTotalClaim       float64
AmountOfInjuryClaim        int64
AmountOfPropertyClaim      int64
AmountOfVehicleDamage      int64
VehicleID                 object
VehicleMake               object
VehicleModel              object
VehicleYOM                object
SplitLimit                 int64
CombinedSingleLimit        int64
dtype: object
'### After conversion: ###'
CustomerID               category
InsuredAge                  int64
InsuredZipCode           category
InsuredGender            category
InsuredEducationLevel    category
InsuredOccupation        category
InsuredHobbies           category
CapitalGains                int64
CapitalLoss                 int64
Country                  category
InsurancePolicyNumber    category
CustomerLoyaltyPeriod       int64
DateOfPolicyCoverage       object
InsurancePolicyState     category
Policy_Deductible           int64
PolicyAnnualPremium       float64
UmbrellaLimit               int64
InsuredRelationship      category
DateOfIncident             object
TypeOfIncident           category
TypeOfCollission         category
SeverityOfIncident       category
AuthoritiesContacted     category
IncidentState            category
IncidentCity             category
IncidentAddress          category
IncidentTime              float64
NumberOfVehicles            int64
PropertyDamage           category
BodilyInjuries              int64
Witnesses                 float64
PoliceReport             category
AmountOfTotalClaim        float64
AmountOfInjuryClaim         int64
AmountOfPropertyClaim       int64
AmountOfVehicleDamage       int64
VehicleID                category
VehicleMake              category
VehicleModel             category
VehicleYOM                 object
SplitLimit                  int64
CombinedSingleLimit         int64
dtype: object
In [34]:
# datetime converrsion

# usning the pandas to_datetime() function we conver the columns into date-time format
test_df['DateOfPolicyCoverage'] = pd.to_datetime(test_df['DateOfPolicyCoverage'], format='%Y-%m-%d')
test_df['DateOfIncident'] = pd.to_datetime(test_df['DateOfIncident'], format='%Y-%m-%d')
test_df['VehicleYOM'] = pd.to_datetime(test_df['VehicleYOM'], format='%Y')
In [35]:
test_df.dtypes
Out[35]:
CustomerID                     category
InsuredAge                        int64
InsuredZipCode                 category
InsuredGender                  category
InsuredEducationLevel          category
InsuredOccupation              category
InsuredHobbies                 category
CapitalGains                      int64
CapitalLoss                       int64
Country                        category
InsurancePolicyNumber          category
CustomerLoyaltyPeriod             int64
DateOfPolicyCoverage     datetime64[ns]
InsurancePolicyState           category
Policy_Deductible                 int64
PolicyAnnualPremium             float64
UmbrellaLimit                     int64
InsuredRelationship            category
DateOfIncident           datetime64[ns]
TypeOfIncident                 category
TypeOfCollission               category
SeverityOfIncident             category
AuthoritiesContacted           category
IncidentState                  category
IncidentCity                   category
IncidentAddress                category
IncidentTime                    float64
NumberOfVehicles                  int64
PropertyDamage                 category
BodilyInjuries                    int64
Witnesses                       float64
PoliceReport                   category
AmountOfTotalClaim              float64
AmountOfInjuryClaim               int64
AmountOfPropertyClaim             int64
AmountOfVehicleDamage             int64
VehicleID                      category
VehicleMake                    category
VehicleModel                   category
VehicleYOM               datetime64[ns]
SplitLimit                        int64
CombinedSingleLimit               int64
dtype: object

4.1.3 Dropping Unwanted Columns

By looking at the data we can determine columns which can be dropped before even building models.

Such Columns are:

  • CustomerID: The column contains the Customer ID's which is unique to every customer, so it does'nt make any contribution in performance of the model, rather it increases the dimentionaly of the data when it is encoded and thus arise the problem of "curse of dimentionality".
  • Country: This column contains only one value across all the records. So it makes next to none contribution in model's performance.
  • InsurancePolicyNumber: Same like the CustomerID, this column also contains the unique ID's of policy numbers of the customers, and fo the same reasons raised above we can safely drop this column too.
  • VehicleID: for the same reasons mentioned above

4.1.3.1 Train Data

In [36]:
# Function to Drop Unccessary Columns in a Data Frame
def drop_unnecessary_columns(DataFrame, cols=None):
  display('Before Dropping  : ', DataFrame.columns) # Printing columns names before dropping selected columns
  print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
  DataFrame.drop(columns = cols, axis = 1, inplace = True) # Dropping the columns
  print("♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦")
  display('After Dropping : ',DataFrame.columns) # Printing columns names after dropping selected columns
  return DataFrame
In [37]:
# storing the columns which are to be dropped in drop_cols
drop_cols = ['CustomerID', 'Country', 'InsurancePolicyNumber', 'VehicleID']

# calling the function defined above to drop the columns
train_df = drop_unnecessary_columns(train_df, cols=drop_cols)
'Before Dropping  : '
Index(['CustomerID', 'InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'Country', 'InsurancePolicyNumber', 'CustomerLoyaltyPeriod', 'DateOfPolicyCoverage', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'DateOfIncident', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleID', 'VehicleMake', 'VehicleModel', 'VehicleYOM', 'ReportedFraud', 'SplitLimit', 'CombinedSingleLimit'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
'After Dropping : '
Index(['InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'DateOfPolicyCoverage', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'DateOfIncident', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleMake', 'VehicleModel', 'VehicleYOM', 'ReportedFraud', 'SplitLimit', 'CombinedSingleLimit'], dtype='object')

4.1.3.2 Test Data

Now that the mentioned columns are dropped in train data, we can follow the same steps in test data too.

In [38]:
# storing the columns which are to be dropped in drop_cols
drop_cols = ['CustomerID', 'Country', 'InsurancePolicyNumber', 'VehicleID']

# calling the function defined above to drop the columns
test_df = drop_unnecessary_columns(test_df, cols=drop_cols)
'Before Dropping  : '
Index(['CustomerID', 'InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'Country', 'InsurancePolicyNumber', 'CustomerLoyaltyPeriod', 'DateOfPolicyCoverage', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'DateOfIncident', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleID', 'VehicleMake', 'VehicleModel', 'VehicleYOM', 'SplitLimit', 'CombinedSingleLimit'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
'After Dropping : '
Index(['InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'DateOfPolicyCoverage', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'DateOfIncident', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleMake', 'VehicleModel', 'VehicleYOM', 'SplitLimit', 'CombinedSingleLimit'], dtype='object')

4.1.4 Dealing with Duplicate Values

Lets look at the dataframes to check if there are any duplicate records present.

4.1.4.1 Train Data

In [39]:
# this function checks for duplicate records in the dataframe
def check_duplicates(df):
    # Check for duplicate records
    duplicate_rows = df[df.duplicated()]
    
    # Check if there are any duplicate records
    if duplicate_rows.empty:
        print("There are no duplicate records.")
    else:
        print("Duplicate records:")
        print(duplicate_rows)
In [40]:
check_duplicates(train_df)
There are no duplicate records.

4.1.4.2 Test Data

In [41]:
check_duplicates(test_df)
There are no duplicate records.

4.1.5 Feature Extraction From Dates Column

Extracting features from the dates column in the data

4.1.5.1 Train Data

In [42]:
# Derive Age of Vehicle on the day of Incident
train_df['VehicleAge'] = (train_df['DateOfIncident'] - train_df['VehicleYOM']).dt.days 

# Derive Day of the Week of Incident
train_df['DayOfWeek'] = train_df['DateOfIncident'].dt.day_name()

# Derive Month of Incident
train_df['MonthOfIncident'] = train_df['DateOfIncident'].dt.month_name()

# Derive Time between Policy Coverage and Incident
train_df['TimeBetweenCoverageAndIncident'] = (train_df['DateOfIncident'] - train_df['DateOfPolicyCoverage']).dt.days
In [43]:
train_df.shape
Out[43]:
(28836, 43)
In [44]:
train_df.drop(['DateOfIncident', 'VehicleYOM', 'DateOfPolicyCoverage'], axis=1, inplace=True)
In [45]:
train_df.head()
Out[45]:
InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss CustomerLoyaltyPeriod InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleMake VehicleModel ReportedFraud SplitLimit CombinedSingleLimit VehicleAge DayOfWeek MonthOfIncident TimeBetweenCoverageAndIncident
0 35 454776 MALE JD armed-forces movies 56700 -48500 49 State1 1000 1632.73 0 not-in-family Multi-vehicle Collision Side Collision Total Loss Police State7 City1 Location 1311 17.0 3 NaN 1 0.0 NaN 65501.0 13417 6071 46013 Audi A5 N 100 300 2590 Tuesday February 5945
1 36 454776 MALE JD tech-support cross-fit 70600 -48500 114 State1 1000 1255.19 0 not-in-family Multi-vehicle Collision Side Collision Total Loss Police State7 City5 Location 1311 10.0 3 YES 2 1.0 YES 61382.0 15560 5919 39903 Audi A5 N 100 300 3319 Monday February 5192
2 33 603260 MALE JD armed-forces polo 66400 -63700 167 State3 617 1373.38 0 wife Single Vehicle Collision Side Collision Minor Damage Other State8 City6 Location 2081 22.0 1 YES 2 3.0 NO 66755.0 11630 11630 43495 Volkswagen Jetta N 500 1000 5858 Thursday January 5085
3 36 474848 MALE JD armed-forces polo 47900 -73400 190 State2 722 1337.60 0 own-child Single Vehicle Collision Side Collision Minor Damage Other State9 City6 Location 2081 22.0 1 YES 2 3.0 NO 66243.0 12003 12003 42237 Volkswagen Jetta N 500 1000 4401 Monday January 3570
4 29 457942 FEMALE High School exec-managerial dancing 0 -41500 115 State2 500 1353.73 4279863 unmarried Single Vehicle Collision Rear Collision Minor Damage Fire State8 City6 Location 1695 10.0 1 NO 2 1.0 YES 53544.0 8829 7234 37481 Toyota CRV N 100 300 1834 Friday January 6650
In [46]:
# Change the dtype of 'DayOfWeek', 'MonthOfIncident' column from object to category
train_df['DayOfWeek'] = train_df['DayOfWeek'].astype('category')
train_df['MonthOfIncident'] = train_df['MonthOfIncident'].astype('category')
In [47]:
train_df.head()
Out[47]:
InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss CustomerLoyaltyPeriod InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleMake VehicleModel ReportedFraud SplitLimit CombinedSingleLimit VehicleAge DayOfWeek MonthOfIncident TimeBetweenCoverageAndIncident
0 35 454776 MALE JD armed-forces movies 56700 -48500 49 State1 1000 1632.73 0 not-in-family Multi-vehicle Collision Side Collision Total Loss Police State7 City1 Location 1311 17.0 3 NaN 1 0.0 NaN 65501.0 13417 6071 46013 Audi A5 N 100 300 2590 Tuesday February 5945
1 36 454776 MALE JD tech-support cross-fit 70600 -48500 114 State1 1000 1255.19 0 not-in-family Multi-vehicle Collision Side Collision Total Loss Police State7 City5 Location 1311 10.0 3 YES 2 1.0 YES 61382.0 15560 5919 39903 Audi A5 N 100 300 3319 Monday February 5192
2 33 603260 MALE JD armed-forces polo 66400 -63700 167 State3 617 1373.38 0 wife Single Vehicle Collision Side Collision Minor Damage Other State8 City6 Location 2081 22.0 1 YES 2 3.0 NO 66755.0 11630 11630 43495 Volkswagen Jetta N 500 1000 5858 Thursday January 5085
3 36 474848 MALE JD armed-forces polo 47900 -73400 190 State2 722 1337.60 0 own-child Single Vehicle Collision Side Collision Minor Damage Other State9 City6 Location 2081 22.0 1 YES 2 3.0 NO 66243.0 12003 12003 42237 Volkswagen Jetta N 500 1000 4401 Monday January 3570
4 29 457942 FEMALE High School exec-managerial dancing 0 -41500 115 State2 500 1353.73 4279863 unmarried Single Vehicle Collision Rear Collision Minor Damage Fire State8 City6 Location 1695 10.0 1 NO 2 1.0 YES 53544.0 8829 7234 37481 Toyota CRV N 100 300 1834 Friday January 6650
In [48]:
train_df.dtypes
Out[48]:
InsuredAge                           int64
InsuredZipCode                    category
InsuredGender                     category
InsuredEducationLevel             category
InsuredOccupation                 category
InsuredHobbies                    category
CapitalGains                         int64
CapitalLoss                          int64
CustomerLoyaltyPeriod                int64
InsurancePolicyState              category
Policy_Deductible                    int64
PolicyAnnualPremium                float64
UmbrellaLimit                        int64
InsuredRelationship               category
TypeOfIncident                    category
TypeOfCollission                  category
SeverityOfIncident                category
AuthoritiesContacted              category
IncidentState                     category
IncidentCity                      category
IncidentAddress                   category
IncidentTime                       float64
NumberOfVehicles                     int64
PropertyDamage                    category
BodilyInjuries                       int64
Witnesses                          float64
PoliceReport                      category
AmountOfTotalClaim                 float64
AmountOfInjuryClaim                  int64
AmountOfPropertyClaim                int64
AmountOfVehicleDamage                int64
VehicleMake                       category
VehicleModel                      category
ReportedFraud                     category
SplitLimit                           int64
CombinedSingleLimit                  int64
VehicleAge                           int64
DayOfWeek                         category
MonthOfIncident                   category
TimeBetweenCoverageAndIncident       int64
dtype: object

4.1.5.2 Test Data

In [49]:
# Derive Age of Vehicle on the day of Incident
test_df['VehicleAge'] = (test_df['DateOfIncident'] - test_df['VehicleYOM']).dt.days 

# Derive Day of the Week of Incident
test_df['DayOfWeek'] = test_df['DateOfIncident'].dt.day_name()

# Derive Month of Incident
test_df['MonthOfIncident'] = test_df['DateOfIncident'].dt.month_name()

# Derive Time between Policy Coverage and Incident
test_df['TimeBetweenCoverageAndIncident'] = (test_df['DateOfIncident'] - test_df['DateOfPolicyCoverage']).dt.days
In [50]:
test_df.drop(['DateOfIncident', 'VehicleYOM', 'DateOfPolicyCoverage'], axis=1, inplace=True)
In [51]:
# Change the dtype of 'DayOfWeek', 'MonthOfIncident' column from object to category
test_df['DayOfWeek'] = test_df['DayOfWeek'].astype('category')
test_df['MonthOfIncident'] = test_df['MonthOfIncident'].astype('category')
In [52]:
test_df.head()
Out[52]:
InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss CustomerLoyaltyPeriod InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleMake VehicleModel SplitLimit CombinedSingleLimit VehicleAge DayOfWeek MonthOfIncident TimeBetweenCoverageAndIncident
0 27 471704 FEMALE High School adm-clerical base-jumping 56400 -57000 84 State2 2000 1006.00 0 own-child Multi-vehicle Collision Front Collision Minor Damage Ambulance State5 City2 Location 1354 4.0 3 NO 0 0.0 NaN 68354.0 6835 8059 53460 Volkswagen Passat 500 1000 7340 Thursday February 6115
1 40 455810 FEMALE MD prof-specialty golf 56700 -65600 232 State3 500 1279.17 0 unmarried Single Vehicle Collision Rear Collision Minor Damage Fire State9 City5 Location 1383 16.0 1 NaN 1 1.0 NaN 55270.0 8113 5240 41917 Nissan Ultima 100 300 3299 Tuesday January 1160
2 39 461919 MALE JD other-service movies 30400 0 218 State2 1000 1454.67 1235986 other-relative Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 2030 20.0 3 NO 0 1.0 NaN 59515.0 7490 9110 42915 Suburu Impreza 250 500 1830 Monday January 1633
3 38 600904 FEMALE Masters exec-managerial video-games 68500 0 205 State3 2000 1287.76 5873212 wife Vehicle Theft NaN Trivial Damage None State7 City5 Location 1449 10.0 1 NaN 2 1.0 NaN 4941.0 494 866 3581 Accura TL 500 500 2193 Saturday January 5228
4 29 430632 FEMALE PhD sales board-games 35100 0 134 State3 2000 1413.14 5000000 own-child Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 1916 7.0 3 NO 2 3.0 NO 34650.0 7700 3850 23100 Dodge RAM 100 300 2974 Sunday February 5282

4.1.6 Correlation Heatmap

Plotting a heatmap to visualize correlation between variables

4.1.6.1 Train Data

In [53]:
import seaborn as sns
In [54]:
# Visualizing the correlation matrix by plotting heat map.
plt.style.use('seaborn-pastel')

upper_triangle = np.triu(train_df.corr())
plt.figure(figsize=(25,25))
sns.heatmap(train_df.corr(), vmin=-1, vmax=1, annot=True, square=True, fmt='0.2f', 
            annot_kws={'size':16}, cmap="plasma", mask=upper_triangle)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()
  • This heatmap shows the correlation matrix by visualizing the data. we can observe the relation between one feature to other.

  • This heatmap contains both positive and negative correlation.

  • We can observe the most of the columns are highly correlated with each other which leads to the multicollinearity problem.

4.1.6.2 Test Data

In [55]:
# Visualizing the correlation matrix by plotting heat map.
plt.style.use('seaborn-pastel')

upper_triangle = np.triu(test_df.corr())
plt.figure(figsize=(25,25))
sns.heatmap(test_df.corr(), vmin=-1, vmax=1, annot=True, square=True, fmt='0.2f', 
            annot_kws={'size':16}, cmap="plasma", mask=upper_triangle)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()

4.2 EDA & Visualizations

Missing Values

In [56]:
msno.matrix(train_df)
plt.figure(figsize = (15,9))
plt.show()
<Figure size 1080x648 with 0 Axes>
In [57]:
msno.matrix(test_df)
plt.figure(figsize = (15,9))
plt.show()
<Figure size 1080x648 with 0 Axes>

Findings:

  • There are large chunks of missing records in our data, as we can see in the graph above. Using mean, median & mode to impute the missing data would cause bias towards those classes. So its better to implement some advance techniques to impute the missing data.

Finding Potential Outliers

In [58]:
columns= ['InsuredAge',
 'CapitalGains',
 'CapitalLoss',
 'CustomerLoyaltyPeriod',
 'Policy_Deductible',
 'PolicyAnnualPremium',
 'UmbrellaLimit',
 'IncidentTime',
 'NumberOfVehicles',
 'BodilyInjuries',
 'Witnesses',
 'AmountOfTotalClaim',
 'AmountOfInjuryClaim',
 'AmountOfPropertyClaim',
 'AmountOfVehicleDamage',
 'SplitLimit',
 'CombinedSingleLimit',
 'VehicleAge']
def plot_boxplot_side_by_side(df1, df2, columns):
    # Create two subplots
    fig, axs = plt.subplots(nrows=len(columns), ncols=2, figsize=(15, 5*len(columns)))
    
    for i, col in enumerate(columns):
        # Draw the first plot on the left
        sns.boxplot(x=df1[col], ax=axs[i, 0],color="teal")
        axs[i, 0].set_xlabel(col + " (Train)")
        
        # Draw the second plot on the right
        sns.boxplot(x=df2[col], ax=axs[i, 1],color="indigo")
        axs[i, 1].set_xlabel(col + " (Test)")
        
    # Show the plot
    plt.show()
    
plot_boxplot_side_by_side(train_df, test_df, columns)

Findings:

  • The following columns could contain potential outliers
    • InsuredAge
    • CustomerLoyaltyPeriod
    • PolicyAnnualPremium
    • UmbrellaLimit
    • AmountOfTotalClaim
    • AmpountOfInjuryClaim
    • AmountOfPropertClaim
    • AmountOfVehicleDamage
  • We dont do anything to them as of now, as we cant say for sure they effect our model.

Checking the spread of Numeric Data

In [59]:
# columns= ['InsuredAge',
#  'CapitalGains',
#  'CapitalLoss',
#  'CustomerLoyaltyPeriod',
#  'Policy_Deductible',
#  'PolicyAnnualPremium',
#  'UmbrellaLimit',
#  'IncidentTime',
#  'NumberOfVehicles',
#  'BodilyInjuries',
#  'Witnesses',
#  'AmountOfTotalClaim',
#  'AmountOfInjuryClaim',
#  'AmountOfPropertyClaim',
#  'AmountOfVehicleDamage',
#  'SplitLimit',
#  'CombinedSingleLimit',
#  'VehicleAge']
# def plot_kdeplot_side_by_side(df1, df2, columns):
#     # Create two subplots
#     fig, axs = plt.subplots(nrows=len(columns), ncols=2, figsize=(15, 5*len(columns)))
    
#     for i, col in enumerate(columns):
#         # Draw the first plot on the left
#         sns.kdeplot(df1[col], ax=axs[i, 0], fill=True, color='teal')
#         axs[i, 0].set_xlabel(col + " (Train)")
        
#         # Draw the second plot on the right
#         sns.kdeplot(df2[col], ax=axs[i, 1], fill=True, color='indigo')
#         axs[i, 1].set_xlabel(col + " (Test)")
        
#     # Show the plot
#     plt.show()
    
# plot_kdeplot_side_by_side(train_df, test_df, columns)
In [60]:
# Checking the skewness in our DataFrame
train_df.skew().sort_values()
Out[60]:
AmountOfVehicleDamage            -0.819908
AmountOfTotalClaim               -0.782618
CapitalLoss                      -0.503664
TimeBetweenCoverageAndIncident   -0.047200
IncidentTime                     -0.045473
Witnesses                         0.007207
PolicyAnnualPremium               0.012344
BodilyInjuries                    0.024941
VehicleAge                        0.046000
AmountOfInjuryClaim               0.094468
AmountOfPropertyClaim             0.175570
SplitLimit                        0.393830
CustomerLoyaltyPeriod             0.394522
InsuredAge                        0.506413
NumberOfVehicles                  0.509554
Policy_Deductible                 0.567654
CombinedSingleLimit               0.582745
CapitalGains                      0.620799
UmbrellaLimit                     1.925123
dtype: float64

Findings:

  • We see there is skewness in the data, so we need to normalize our data using StandardScaler from sklearn.

Distribution of individual classes in our DataFrame

In [61]:
def generate_pie(df):
    plt.figure(figsize=(10,5))
    plt.pie(df.value_counts(), labels=df.value_counts().index, autopct='%1.2f%%',shadow=True, explode=(0,.2), colors=['#006400' ,'#B22222'])
    plt.legend(prop={'size':9})
    plt.axis('equal')
    return plt.show()

generate_pie(train_df.ReportedFraud)

From the plot we can observe that the count of "N" is high compared to "Y". Here most of the insurance claims have not reported as fradulent, which is usually the case.

Since it is our target column, it indicates the class imbalance issue. We will balance the data using oversampling method in later part.

In [62]:
# getting categoric columns in ca_cols variable
cat_cols = ['InsuredGender', 'InsuredEducationLevel',
            'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident',
            'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted',
            'PropertyDamage', 'PoliceReport', 
            'MonthOfIncident']

def generate_pie(df):
    plt.figure(figsize=(8,3))
    plt.pie(df.value_counts(), labels=df.value_counts().index, autopct='%1.2f%%',shadow=True)
    plt.legend(prop={'size':9})
    plt.axis('equal')
    return plt.show()
    

for col in train_df[cat_cols]:
    print(f"Pie plot for the column:", col)
    generate_pie(train_df[col])
Pie plot for the column: InsuredGender
Pie plot for the column: InsuredEducationLevel
Pie plot for the column: InsurancePolicyState
Pie plot for the column: InsuredRelationship
Pie plot for the column: TypeOfIncident
Pie plot for the column: TypeOfCollission
Pie plot for the column: SeverityOfIncident
Pie plot for the column: AuthoritiesContacted
Pie plot for the column: PropertyDamage
Pie plot for the column: PoliceReport
Pie plot for the column: MonthOfIncident

Findings:

  • InsuredGender: We see there are more females who are insured than males
  • InsuredEducationLevel: As the plots projects, the people with 'JD' degrees are the most in numbeers to get insured and people who has college degrees are the least in numbers.
  • InsurancePolicyState: "State3" has the highest population with people havings inurance and "State2" being the least.
  • InsuredRelationship: From the graph we can derive that, 'own-child' was the most opted to be the nominee, followed by "not-in-family" in the second place and "unmaried" opted the last.
  • TypeOfIncident: "Multi-vehicle Collision" is the most frequent incident wheres the "Parked Car" being the least.
  • TypeOfCollision: "Rear Collision" is the most comman type of collision that happens whereas "Front Collision" being the least but by only 6%.
  • SeverityOfIncident: Majority of the damage that happens to the vehicle are "Minor Damage", while "Trivial Damage" is the least.
  • AuthoritiesContacted: "Police" was the most contacted after an incident, although therer are some cases where none of the authorities were contacted.
  • PropertyDamage: Most of the time there was no property damage, although 47.29% times there was some property damage caused.
  • PoliceReport: 52% of the time police report was not available.
  • MonthOfIncident: 51.31% of the incidents happened in January, 47.87% in February and only 0.82% in March.
In [63]:
# getting categoric columns in ca_cols variable
cat_cols = ['InsuredOccupation', 'InsuredHobbies', 'IncidentState', 'IncidentCity', 'VehicleMake', 'DayOfWeek', 'MonthOfIncident']

def plot_countplots(df, columns):
    # Calculate the number of rows and columns needed for the plot
    n_cols = 2
    n_rows = (len(columns) + 1) // 2
    
    # Create the subplot grid
    fig, axs = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(12, 4*n_rows))
    plt.xticks(rotation=90)
    
    # Flatten the subplot grid
    axs = axs.flatten()
    
    # Loop over the columns and plot a countplot for each one
    for i, col in enumerate(columns):
        sns.countplot(x=col, data=df, ax=axs[i],palette="gnuplot", )
        axs[i].set_xticklabels(axs[i].get_xticklabels(), rotation=90)
        
    # Remove any unused subplots
    if len(columns) < n_rows * n_cols:
        for i in range(len(columns), n_rows * n_cols):
            fig.delaxes(axs[i])
    
    # Display the plot
    plt.tight_layout()
    plt.show()
    
plot_countplots(train_df, cat_cols)

Finding:

  • In the insured occupation we can observe most of the data is covered by machine inspector followed by professional speciality. Apart from this all the other insured occupations have almost same counts.
  • With respect to insured hobbies, we can notice bungie-jumping covered the highest data followed by camping and paintball. And other categories have the average counts.
  • With respect to the incident state, State5, State7 and State9 states have highest counts.
  • In incident city, almost all the columns have equal counts.
  • When we look at the vehicle manufactured companies, the categories Saab, Suburu, Dodge and Nissan have highest counts.
  • Incase of DayOfWeek almost all they days have had equal number of incidents.
In [64]:
# Comparision between two variables
plt.figure(figsize=[18,13])

plt.subplot(2,2,1)
plt.title('Comparision CustomerLoyaltyPeriod and InsuredAge')
sns.scatterplot(x=train_df['CustomerLoyaltyPeriod'],y=train_df['InsuredAge'],hue=train_df['ReportedFraud'],palette="gnuplot");

plt.subplot(2,2,2)
plt.title('Comparision between AmountOfTotalClaim and AmountOfInjuryClaim')
sns.scatterplot(x=train_df['AmountOfTotalClaim'],y=train_df['AmountOfInjuryClaim'],hue=train_df['ReportedFraud'],palette="gnuplot");

plt.subplot(2,2,3)
plt.title('Comparision between AmountOfPropertyClaim and AmountOfVehicleDamage')
sns.scatterplot(x=train_df['AmountOfPropertyClaim'],y=train_df['AmountOfVehicleDamage'],hue=train_df['ReportedFraud'],palette="gnuplot");

plt.subplot(2,2,4)
plt.title('Comparision between CustomerLoyaltyPeriod and AmountOfTotalClaim')
sns.scatterplot(x=train_df['CustomerLoyaltyPeriod'],y=train_df['AmountOfTotalClaim'],hue=train_df['ReportedFraud'],palette="gnuplot");

From the above scatter plot we can observe the following things:

  • There is a positive linear relation between InsuredAge and CustomerLoyaltyPeriod column. As InsuredAge increases the CustomerLoyaltyPeriod also increases, also the fraud reported is very less in hits case. In the second graph we can observe the positive linear relation, as total claim amount increases, injury claim is also increases. Third plot is also same as second one that is as the property claim increases, vehicle claim also increases. In the fourth plot we can observe the data is scattered and there is no much relation between the features.
In [65]:
fig,axes=plt.subplots(2,2,figsize=(12,10))

# Comparing insured_sex and age
sns.violinplot(x='InsuredGender',y='InsuredAge',ax=axes[0,0],data=train_df,palette="ch:.25",hue="ReportedFraud",split=True)

# Comparing policy_state and witnesses
sns.violinplot(x='InsurancePolicyState',y='Witnesses',ax=axes[0,1],data=train_df,hue="ReportedFraud",split=True,palette="hls")

# Comparing csl_per_accident and property_claim
sns.violinplot(x='CombinedSingleLimit',y='AmountOfPropertyClaim',ax=axes[1,0],data=train_df,hue="ReportedFraud",split=True,palette="Dark2")

# Comparing csl_per_person and age
sns.violinplot(x='SplitLimit',y='InsuredAge',ax=axes[1,1],data=train_df,hue="ReportedFraud",split=True,palette="mako")
plt.show()

Findings:

  • The fraud report is high for both the males females having age between 30-45. The people who own the policy state "IN" have high fraud report. The person who has csl per accident insurance by claiming property in the range 5000-15000 have the fraud report. The SplitLimit with age 30-45 are facing the fraudulent reports.
In [66]:
fig,axes=plt.subplots(2,2,figsize=(12,12))

# Comparing insured_sex and age
sns.violinplot(x='ReportedFraud',y='AmountOfTotalClaim',ax=axes[0,0],data=train_df,hue="ReportedFraud" ,palette="hls")

# Comparing policy_state and witnesses
sns.violinplot(x='ReportedFraud',y='AmountOfVehicleDamage',ax=axes[0,1],data=train_df,hue="ReportedFraud",palette="cool_r")

# Comparing csl_per_accident and property_claim
sns.violinplot(x='ReportedFraud',y='AmountOfPropertyClaim',ax=axes[1,0],data=train_df,hue="ReportedFraud",palette="cividis")

# Comparing csl_per_person and age
sns.violinplot(x='ReportedFraud',y='AmountOfInjuryClaim',ax=axes[1,1],data=train_df,hue="ReportedFraud",palette="gnuplot2")
plt.show()

Findings:

  • Most of the fraud reports found when the total claimed amount is 50000-70000. The fraud report is high when the claimed vehicle is between 37000-57000. The fraud reports is high when the property claimed is between 5200-8500. Most fraud reported when injury claim are between 5000 to 8000.
In [67]:
# Comparing policy_state and fraud_reported
sns.catplot(x='InsurancePolicyState',kind='count',data=train_df,hue='ReportedFraud',palette="Dark2")
plt.show()

Findings:

  • Fraud report is bit high in the "State3" policy state.
In [68]:
# Comparing insured_education_level and fraud_reported
sns.catplot(x='InsuredEducationLevel',kind='count',data=train_df,hue='ReportedFraud',palette="tab20b_r")
plt.xticks(rotation=90)
plt.show()

Findings:

  • The fraudulent level is very less for the people who have high school education and the people who have completed their "JD" education have high fraud report. The people who have high insured education are facing insurance fraudulent compared to the people with less insured education level.
In [69]:
# Comparing insured_occupation and fraud_reported
sns.catplot(x='InsuredOccupation',kind='count',data=train_df,hue='ReportedFraud',palette="spring_r")
plt.xticks(rotation=90)
plt.show()

Findings:

  • The people who are in the position exec-managerials have high fraud reports compared to others.
In [70]:
# Comparing insured_hobbies and fraud_reported
sns.catplot(x='InsuredHobbies',kind='count',data=train_df,hue='ReportedFraud',palette="gnuplot2")
plt.xticks(rotation=90)
plt.show()

Findings:

  • The fraud report is high for the people who have the hobby of playing chess and cross fit.
In [71]:
# Comparing insured_relationship and fraud_reported
sns.catplot(x='InsuredRelationship',kind='count',data=train_df,hue='ReportedFraud',palette="gist_earth")
plt.xticks(rotation=90)
plt.show()

Findings:

  • The fraud report is high for the customers who have other relative and it is very less for husband and own-child.
In [72]:
# Comparing incident_type and fraud_reported
sns.catplot(x='TypeOfIncident',kind='count',data=train_df,hue='ReportedFraud',palette="Set2_r")
plt.xticks(rotation=90)
plt.show()

Findings:

  • In Multivehicle collission and single vehicle collision, the fraud report is very high compared to others.
In [73]:
# Comparing collision_type and fraud_reported
sns.catplot(x='TypeOfCollission',kind='count',data=train_df,hue='ReportedFraud',palette="gist_earth")
plt.xticks(rotation=90)
plt.show()

Findings:

  • The fraud level is high in the collision type Rear Collision and other two collision type have average reports.
In [74]:
# Comparing incident_severity and fraud_reported
sns.catplot(x='SeverityOfIncident',kind='count',data=train_df,hue='ReportedFraud',palette="mako")
plt.xticks(rotation=90)
plt.show()

Findings:

  • The fraud report is high in Major damage incident severity and for Trivial Damage the report is less compared to others.
In [75]:
# Comparing authorities_contacted and fraud_reported
sns.catplot(x='AuthoritiesContacted',kind='count',data=train_df,hue='ReportedFraud',palette="magma")
plt.xticks(rotation=90)
plt.show()

Findings:

  • The police contacted cases are very high and the fraud report is almost equal for all the authorities except None.
In [76]:
# Comparing incident_state and fraud_reported
sns.catplot(x='IncidentState',kind='count',data=train_df,col='ReportedFraud',palette="cubehelix")
plt.xticks(rotation=90)
plt.show()

Findings:

  • The "State7" has high fraud reports compared to other states.
In [77]:
# Comparing incident_city and fraud_reported
sns.catplot(x='IncidentCity',kind='count',data=train_df,hue='ReportedFraud',palette="bright")
plt.xticks(rotation=90)
plt.show()

Findings:

  • The "City1" and "City2" have very high fraud reports compared to others.
In [78]:
# Comparing property_damage and fraud_reported
sns.catplot(x='PropertyDamage',kind='count',data=train_df,col='ReportedFraud',palette="ocean")
plt.show()

Findings:

  • The customers who have property damage case they have comparitively high fraud reports.
In [79]:
# Comparing police_report_available and fraud_reported
sns.catplot(x='PoliceReport',kind='count',data=train_df,col='ReportedFraud',palette="coolwarm")
plt.xticks(rotation=90)
plt.show()

Findings:

  • If there is no police report available then the fraud report is very high.
In [80]:
# Comparing auto_make and fraud_reported
sns.catplot(x='VehicleMake',kind='count',data=train_df,hue='ReportedFraud',palette="bright")
plt.xticks(rotation=90)
plt.show()

Findings:

  • In all the auto make cases the fraud report is almost same, except the "Ford" which is a bit on the higher end.
In [81]:
sns.pairplot(train_df,hue="ReportedFraud",palette="ocean")
plt.show()

In Pairplot we can see the relationship between each feature.

In [82]:
pip install ydata-profiling
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: ydata-profiling in /home/5014b121/.local/lib/python3.7/site-packages (4.1.1)
Requirement already satisfied: visions[type_image_path]==0.7.5 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (0.7.5)
Requirement already satisfied: numpy<1.24,>=1.16.0 in /usr/share/anaconda3/lib/python3.7/site-packages (from ydata-profiling) (1.19.5)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/share/anaconda3/lib/python3.7/site-packages (from ydata-profiling) (2.11.1)
Requirement already satisfied: typeguard<2.14,>=2.13.2 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (2.13.3)
Requirement already satisfied: matplotlib<3.7,>=3.2 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (3.5.3)
Requirement already satisfied: pydantic<1.11,>=1.8.1 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (1.10.6)
Requirement already satisfied: requests<2.29,>=2.24.0 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (2.28.2)
Requirement already satisfied: seaborn<0.13,>=0.10.1 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (0.12.2)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/share/anaconda3/lib/python3.7/site-packages (from ydata-profiling) (5.3)
Requirement already satisfied: phik<0.13,>=0.11.1 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (0.12.3)
Requirement already satisfied: tqdm<4.65,>=4.48.2 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (4.64.1)
Requirement already satisfied: pandas!=1.4.0,<1.6,>1.1 in /usr/share/anaconda3/lib/python3.7/site-packages (from ydata-profiling) (1.2.4)
Requirement already satisfied: scipy<1.10,>=1.4.1 in /usr/share/anaconda3/lib/python3.7/site-packages (from ydata-profiling) (1.6.3)
Requirement already satisfied: statsmodels<0.14,>=0.13.2 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (0.13.5)
Requirement already satisfied: multimethod<1.10,>=1.4 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (1.9.1)
Requirement already satisfied: htmlmin==0.1.12 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (0.1.12)
Requirement already satisfied: imagehash==4.3.1 in /home/5014b121/.local/lib/python3.7/site-packages (from ydata-profiling) (4.3.1)
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /home/5014b121/.local/lib/python3.7/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (0.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/share/anaconda3/lib/python3.7/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (2.4)
Requirement already satisfied: attrs>=19.3.0 in /usr/share/anaconda3/lib/python3.7/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (19.3.0)
Requirement already satisfied: Pillow; extra == "type_image_path" in /usr/share/anaconda3/lib/python3.7/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (7.0.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/share/anaconda3/lib/python3.7/site-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (1.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/share/anaconda3/lib/python3.7/site-packages (from matplotlib<3.7,>=3.2->ydata-profiling) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/share/anaconda3/lib/python3.7/site-packages (from matplotlib<3.7,>=3.2->ydata-profiling) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /usr/share/anaconda3/lib/python3.7/site-packages (from matplotlib<3.7,>=3.2->ydata-profiling) (0.10.0)
Requirement already satisfied: packaging>=20.0 in /usr/share/anaconda3/lib/python3.7/site-packages (from matplotlib<3.7,>=3.2->ydata-profiling) (20.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/5014b121/.local/lib/python3.7/site-packages (from matplotlib<3.7,>=3.2->ydata-profiling) (4.38.0)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/share/anaconda3/lib/python3.7/site-packages (from matplotlib<3.7,>=3.2->ydata-profiling) (2.4.6)
Requirement already satisfied: typing-extensions>=4.2.0 in /home/5014b121/.local/lib/python3.7/site-packages (from pydantic<1.11,>=1.8.1->ydata-profiling) (4.5.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/share/anaconda3/lib/python3.7/site-packages (from requests<2.29,>=2.24.0->ydata-profiling) (1.25.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/share/anaconda3/lib/python3.7/site-packages (from requests<2.29,>=2.24.0->ydata-profiling) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/share/anaconda3/lib/python3.7/site-packages (from requests<2.29,>=2.24.0->ydata-profiling) (2019.11.28)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/5014b121/.local/lib/python3.7/site-packages (from requests<2.29,>=2.24.0->ydata-profiling) (3.1.0)
Requirement already satisfied: joblib>=0.14.1 in /usr/share/anaconda3/lib/python3.7/site-packages (from phik<0.13,>=0.11.1->ydata-profiling) (0.14.1)
Requirement already satisfied: pytz>=2017.3 in /usr/share/anaconda3/lib/python3.7/site-packages (from pandas!=1.4.0,<1.6,>1.1->ydata-profiling) (2019.3)
Requirement already satisfied: patsy>=0.5.2 in /home/5014b121/.local/lib/python3.7/site-packages (from statsmodels<0.14,>=0.13.2->ydata-profiling) (0.5.3)
Requirement already satisfied: PyWavelets in /usr/share/anaconda3/lib/python3.7/site-packages (from imagehash==4.3.1->ydata-profiling) (1.1.1)
Requirement already satisfied: decorator>=4.3.0 in /usr/share/anaconda3/lib/python3.7/site-packages (from networkx>=2.4->visions[type_image_path]==0.7.5->ydata-profiling) (4.4.1)
Requirement already satisfied: six>=1.5 in /usr/share/anaconda3/lib/python3.7/site-packages (from python-dateutil>=2.7->matplotlib<3.7,>=3.2->ydata-profiling) (1.15.0)
Requirement already satisfied: setuptools in /usr/share/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib<3.7,>=3.2->ydata-profiling) (45.2.0.post20200210)
Note: you may need to restart the kernel to use updated packages.
In [83]:
from pandas_profiling import ProfileReport
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: `import pandas_profiling` is going to be deprecated by April 1st. Please use `import ydata_profiling` instead.
  """Entry point for launching an IPython kernel.
In [84]:
report = ProfileReport(train_df)
report
Out[84]:

In [85]:
# report.to_file("Visualization-report.html")

4.1.7 Performing Train-Train Split

Splitting the Test dataframe into train and test split so as to evaluate the performance of the model.

4.1.7.1 Train Data

In [86]:
def get_X_y_dataframes(DataFrame, target_variable):
    X = DataFrame.drop(columns=target_variable,axis=1) # Assigning the rest of the column to 'X'
    y = DataFrame[target_variable] # Assiging the 'target_variable' column to 'y'
    print("Columns in X :",X.columns) # Printing columns in X
    print("shape of X :",X.shape)
    print("*"*40)
    print("shape of y :",y.shape)
    #print("Columns in y :",y.columns) # Printing columns in y
    return X,y

X, y = get_X_y_dataframes(train_df, 'ReportedFraud')
Columns in X : Index(['InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleMake', 'VehicleModel', 'SplitLimit', 'CombinedSingleLimit', 'VehicleAge', 'DayOfWeek', 'MonthOfIncident', 'TimeBetweenCoverageAndIncident'], dtype='object')
shape of X : (28836, 39)
****************************************
shape of y : (28836,)
In [87]:
from sklearn.model_selection import train_test_split # Scikit learn offers this library to perform Train and Test Split

# Function to perform train and test split
def perform_only_train_test_split(X,y,test_size=0.3,random_state=1234, stratify = y):
    # X has all the attributes except the target variable
    # y has the target attribute
    # Test size to either split the data into 70-30 split or 80-20 split
    # Stratify to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size, random_state=random_state,stratify = y)
    
    print('X_train shape: ',X_train.shape) # Printing X_train shape
    print('X_test shape: ',X_test.shape) # Printing X_test shape
    print('y_train shape: ',y_train.shape) # Printing y_train shape
    print('y_test shape: ',y_test.shape) # Printing y_test shape
    
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = perform_only_train_test_split(X,y,test_size=0.3,random_state=1234, stratify = y)
X_train shape:  (20185, 39)
X_test shape:  (8651, 39)
y_train shape:  (20185,)
y_test shape:  (8651,)
In [88]:
#To check the distribution in the target in train and test
print("Train Distribution")
print("↓"*20)
print("Train Unique values counts")
print(y_train.value_counts())
print("Train Unique values percentages")
print(y_train.value_counts()/y_train.value_counts().sum()*100)
print("Train Total Unique values counts")
print(y_train.value_counts().sum())
print("*"*40)
print("Test Distribution")
print("↓"*20)
print("Test Unique value counts")
print(y_test.value_counts())
print("Test Unique values percentages")
print(y_test.value_counts()/y_test.value_counts().sum()*100)
print("Test Total Unique values counts")
print(y_test.value_counts().sum())
Train Distribution
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
Train Unique values counts
N    14736
Y     5449
Name: ReportedFraud, dtype: int64
Train Unique values percentages
N    73.004706
Y    26.995294
Name: ReportedFraud, dtype: float64
Train Total Unique values counts
20185
****************************************
Test Distribution
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
Test Unique value counts
N    6315
Y    2336
Name: ReportedFraud, dtype: int64
Test Unique values percentages
N    72.997341
Y    27.002659
Name: ReportedFraud, dtype: float64
Test Total Unique values counts
8651

4.1.8 Handling Missing Values

As we Know if our data has missing values then our model will not train except few models which can tolerate them like some tree based models but the point is we want to handle this and how can we handle them. So, in this section we handle missing data.

We can handle missing data using following techniques :-

  • Deletion of Data
  • Encoding Missingness
  • Imputation Methods

1. Deletion of Data:

  • The simplest approach for dealing with missing values is to remove entire record that contain missing values. By doing this we loose a lot of potential data which can be used for model building. If the number of records are low, we may end up with very little data which may not be sufficient to train a model.

2. Encoding Missingness:

  • When an attribute is discrete in nature, missingness can be directly encoded into the attribute as if it were a naturally occurring category.

3. Imputation Methods:

  • Another approach to handling missing values is to impute or estimate them. Imputation uses information and relationships among the non-missing attributes to provide an estimate to fill in the missing value. In Imputation, there are many techniques which are followed in Data Science.

Some of the prominent ones are:

  • Imputing missing values with Mean/Median/Mode, depending on the data type of the column.
    • In this method we use Mean/Median/Mode of that particular column to impute the missing value. But when we do this impution we only consider the values of that particular column, and also if the volume of the missing values is high the model may become biased towards that column.
  • Linear model for imputation.
    • In this method we build a linear model to predict our missing value. But when we use this method we assume there are no other missing values in the record and the relaionship is linear. This is not ideal every scenario.
  • KNN for Imputation.
    • When the training set is small or moderate in size, K-nearest neighbors can be a quick and effective method for imputing missing values. This procedure identifies a sample with one or more missing values. Then it identifies the K most similar samples in the training data that are complete (i.e., have no missing values in some columns). Similarity of samples for this method is defined by a distance metric. When all of the predictors are numeric, standard Euclidean distance is commonly used as the similarity metric. After computing the distances, the K closest samples to the sample with the missing value are identified and the average value of the predictor of interest is calculated. This value is then used to replace the missing value of the sample. This method works well and overcomes the above stated issues, but it requires the record to be of the same data type i.e numeric and have no missing value. This does an ok job on categorical variables, but there is another method which works really well on both categorical and numerical data.
  • Tree Based Imputation.
    • This approach is not only able to deal with mixed type variables, but is also more reliable in imputation, both in the case of missing at random (MAR) and missing not at random (MNAR) which is often the more common case in real-world data. In this method we can choode any tree based model to predict our missing values.
    • Note: This imputation technique is very computationally heavy. When done in a cluster, the results were:
      • 45 GB of ram for Categorical Imputation
      • 7 GB of ram for Numerical Imputation

4.1.8.1 Train Data

Exploring Missing Data
</a>

Vizualising Missing Values in the DataFrame
</a>

  • It is the nullity matrix that allows us to see the distribution of data across all columns in the whole dataset. It also shows a sparkline (or, in some cases, a striped line) that emphasizes rows in a dataset with the highest and lowest nullity.
In [89]:
msno.matrix(X_train)
plt.figure(figsize = (15,9))
plt.show()
<Figure size 1080x648 with 0 Axes>

Correlation Heatmap
</a>

  • Correlation heatmap measures nullity correlation between columns of the dataset. It shows how strongly the presence or absence of one feature affects the other.

Nullity correlation ranges from(-1 to 1):

  • -1 means if one column(attribute) is present, the other is almost certainly absent.
  • 0 means there is no dependence between the columns(attributes).
  • 1 means if one column(attributes) is present, the other is also certainly present.

    • Unlike in a familiar correlation heatmap, if you see here, many columns are missing. Those columns which are always full or always empty have no meaningful correlation and are removed from the visualization.

    • The heatmap is helpful for identifying data completeness correlations between attribute pairs, but it has the limited explanatory ability for broader relationships and no special support for really big datasets.

In [90]:
msno.heatmap(X_train,cmap="PiYG", figsize=(20,10), fontsize=13);

Dendrogram
</a>

  • The dendrogram shows the hierarchical nullity relationship between columns. The dendrogram uses a hierarchical clustering algorithm against one another by their nullity correlation.
In [91]:
msno.dendrogram(train_df, figsize=(20,15), fontsize=15, label_rotation=90);

Simple Numeric Summaries
</a>

  • Moving Forward lets try to analyse numerical summary of missing attributes. Simple numerical summaries are effective at identifying problematic predictors and samples when the data become too large to visually inspect.
In [92]:
# This function takes a DataFrame(df) as input and returns two columns, total missing values and total missing values percentage
def missing_percentage(df):
    total = df.isnull().sum().sort_values(ascending = False) # Checking the number of unique values in the columns and assiging it to 'total'
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2) # Converting the null value count into percentage and assigning it to percent
    return pd.concat([total, percent], axis=1, keys=['Total','Percentage']) # Returning a table with columns 'Total' & 'Percentage'

missing_percentage(X_train)
Out[92]:
Total Percentage
PropertyDamage 7296 36.15
PoliceReport 6853 33.95
TypeOfCollission 3664 18.15
PolicyAnnualPremium 99 0.49
AmountOfTotalClaim 36 0.18
VehicleMake 36 0.18
Witnesses 34 0.17
InsuredGender 23 0.11
IncidentTime 22 0.11
InsuredAge 0 0.00
AmountOfInjuryClaim 0 0.00
BodilyInjuries 0 0.00
AmountOfVehicleDamage 0 0.00
AmountOfPropertyClaim 0 0.00
VehicleModel 0 0.00
SplitLimit 0 0.00
CombinedSingleLimit 0 0.00
VehicleAge 0 0.00
DayOfWeek 0 0.00
MonthOfIncident 0 0.00
NumberOfVehicles 0 0.00
IncidentCity 0 0.00
IncidentAddress 0 0.00
InsurancePolicyState 0 0.00
InsuredEducationLevel 0 0.00
InsuredOccupation 0 0.00
InsuredHobbies 0 0.00
CapitalGains 0 0.00
CapitalLoss 0 0.00
CustomerLoyaltyPeriod 0 0.00
Policy_Deductible 0 0.00
InsuredZipCode 0 0.00
UmbrellaLimit 0 0.00
InsuredRelationship 0 0.00
TypeOfIncident 0 0.00
SeverityOfIncident 0 0.00
AuthoritiesContacted 0 0.00
IncidentState 0 0.00
TimeBetweenCoverageAndIncident 0 0.00
In [93]:
# function to create mapping dictionaries
def create_map_dict(df, column):
    unique_values = df[column].dropna().unique()
    map_dict = {value: index for index, value in enumerate(unique_values)}
    return map_dict
In [94]:
X_train.head()
Out[94]:
InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss CustomerLoyaltyPeriod InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleMake VehicleModel SplitLimit CombinedSingleLimit VehicleAge DayOfWeek MonthOfIncident TimeBetweenCoverageAndIncident
24088 36 443625 MALE Masters priv-house-serv camping 0 0 194 State3 1000 1440.36 3392565 other-relative Single Vehicle Collision Side Collision Minor Damage Fire State7 City5 Location 1851 17.0 2 NO 1 0.0 NO 73318.0 9164 9164 54990 BMW X6 100 300 6579 Tuesday January 6781
4722 28 472248 FEMALE JD machine-op-inspct skydiving 0 0 126 State2 512 1510.11 0 other-relative Multi-vehicle Collision Front Collision Total Loss Police State5 City4 Location 1159 3.0 4 NO 0 1.0 NaN 68029.0 11367 11367 45295 Ford F150 100 500 1461 Thursday January 8397
6441 46 457875 FEMALE JD sales dancing 46100 0 277 State1 1764 1098.15 0 wife Single Vehicle Collision Side Collision Minor Damage Police State7 City7 Location 1324 11.0 1 NO 1 2.0 NaN 37502.0 4167 4167 29168 Nissan Maxima 250 500 2245 Tuesday February 1481
24858 38 476303 FEMALE JD sales video-games 0 -60300 189 State1 1950 1214.11 0 wife Multi-vehicle Collision Front Collision Total Loss Police State8 City1 Location 1073 23.0 3 YES 2 2.0 NO 68628.0 11510 5755 51363 Saab 95 250 300 5903 Sunday March 3237
22041 29 441726 FEMALE Associate handlers-cleaners golf 0 -66200 124 State3 510 1308.77 0 own-child Multi-vehicle Collision Side Collision Major Damage Ambulance State8 City7 Location 1443 14.0 3 NaN 0 3.0 YES 71719.0 7172 7172 57375 Audi A3 500 1000 3707 Wednesday February 7828
In [95]:
InsuredZipCode_map = create_map_dict(X_train, 'InsuredZipCode')
InsuredGender_map = create_map_dict(X_train, 'InsuredGender')
InsuredEducationLevel_map = create_map_dict(X_train, 'InsuredEducationLevel')
InsuredOccupation_map = create_map_dict(X_train, 'InsuredOccupation')
InsuredHobbies_map = create_map_dict(X_train, 'InsuredHobbies')
InsurancePolicyState_map = create_map_dict(X_train, 'InsurancePolicyState')
InsuredRelationship_map = create_map_dict(X_train, 'InsuredRelationship')
TypeOfIncident_map = create_map_dict(X_train, 'TypeOfIncident')
TypeOfCollission_map = create_map_dict(X_train, 'TypeOfCollission')
SeverityOfIncident_map = create_map_dict(X_train, 'SeverityOfIncident')
AuthoritiesContacted_map = create_map_dict(X_train, 'AuthoritiesContacted')
IncidentState_map = create_map_dict(X_train, 'IncidentState')
IncidentCity_map = create_map_dict(X_train, 'IncidentCity')
IncidentAddress_map = create_map_dict(X_train, 'IncidentAddress')
PropertyDamage_map = create_map_dict(X_train, 'PropertyDamage')
PoliceReport_map = create_map_dict(X_train, 'PoliceReport')
VehicleMake_map = create_map_dict(X_train, 'VehicleMake')
VehicleModel_map = create_map_dict(X_train, 'VehicleModel')
DayOfWeek_map = create_map_dict(X_train, 'DayOfWeek')
MonthOfIncident_map = create_map_dict(X_train, 'MonthOfIncident')
In [96]:
import pickle

# pickle.dump(InsuredZipCode_map,open('../Pickle Files/InsuredZipCode_map.pkl','wb'))
# pickle.dump(InsuredGender_map,open('../Pickle Files/InsuredGender_map.pkl','wb'))
# pickle.dump(InsuredEducationLevel_map,open('../Pickle Files/InsuredEducationLevel_map.pkl','wb'))
# pickle.dump(InsuredOccupation_map,open('../Pickle Files/InsuredOccupation_map.pkl','wb'))
# pickle.dump(InsuredHobbies_map,open('../Pickle Files/InsuredHobbies_map.pkl','wb'))
# pickle.dump(InsurancePolicyState_map,open('../Pickle Files/InsurancePolicyState_map.pkl','wb'))
# pickle.dump(InsuredRelationship_map,open('../Pickle Files/InsuredRelationship_map.pkl','wb'))
# pickle.dump(TypeOfIncident_map,open('../Pickle Files/TypeOfIncident_map.pkl','wb'))
# pickle.dump(TypeOfCollission_map,open('../Pickle Files/TypeOfCollission_map.pkl','wb'))
# pickle.dump(SeverityOfIncident_map,open('../Pickle Files/SeverityOfIncident_map.pkl','wb'))
# pickle.dump(AuthoritiesContacted_map,open('../Pickle Files/AuthoritiesContacted_map.pkl','wb'))
# pickle.dump(IncidentState_map,open('../Pickle Files/IncidentState_map.pkl','wb'))
# pickle.dump(IncidentCity_map,open('../Pickle Files/IncidentCity_map.pkl','wb'))
# pickle.dump(IncidentAddress_map,open('../Pickle Files/IncidentAddress_map.pkl','wb'))
# pickle.dump(PropertyDamage_map,open('../Pickle Files/PropertyDamage_map.pkl','wb'))
# pickle.dump(PoliceReport_map,open('../Pickle Files/PoliceReport_map.pkl','wb'))
# pickle.dump(VehicleMake_map,open('../Pickle Files/VehicleMake_map.pkl','wb'))
# pickle.dump(VehicleModel_map,open('../Pickle Files/VehicleModel_map.pkl','wb'))
# pickle.dump(DayOfWeek_map,open('../Pickle Files/DayOfWeek_map.pkl','wb'))
# pickle.dump(MonthOfIncident_map,open('../Pickle Files/MonthOfIncident_map.pkl','wb'))
In [97]:
X_train.loc[:,'InsuredZipCode'] = X_train['InsuredZipCode'].map(InsuredZipCode_map)
X_train.loc[:,'InsuredGender'] = X_train['InsuredGender'].map(InsuredGender_map)
X_train.loc[:,'InsuredEducationLevel'] = X_train['InsuredEducationLevel'].map(InsuredEducationLevel_map)
X_train.loc[:,'InsuredOccupation'] = X_train['InsuredOccupation'].map(InsuredOccupation_map)
X_train.loc[:,'InsuredHobbies'] = X_train['InsuredHobbies'].map(InsuredHobbies_map)
X_train.loc[:,'InsurancePolicyState'] = X_train['InsurancePolicyState'].map(InsurancePolicyState_map)
X_train.loc[:,'InsuredRelationship'] = X_train['InsuredRelationship'].map(InsuredRelationship_map)
X_train.loc[:,'TypeOfIncident'] = X_train['TypeOfIncident'].map(TypeOfIncident_map)
X_train.loc[:,'TypeOfCollission'] = X_train['TypeOfCollission'].map(TypeOfCollission_map)
X_train.loc[:,'SeverityOfIncident'] = X_train['SeverityOfIncident'].map(SeverityOfIncident_map)
X_train.loc[:,'AuthoritiesContacted'] = X_train['AuthoritiesContacted'].map(AuthoritiesContacted_map)
X_train.loc[:,'IncidentState'] = X_train['IncidentState'].map(IncidentState_map)
X_train.loc[:,'IncidentCity'] = X_train['IncidentCity'].map(IncidentCity_map)
X_train.loc[:,'IncidentAddress'] = X_train['IncidentAddress'].map(IncidentAddress_map)
X_train.loc[:,'PropertyDamage'] = X_train['PropertyDamage'].map(PropertyDamage_map)
X_train.loc[:,'PoliceReport'] = X_train['PoliceReport'].map(PoliceReport_map)
X_train.loc[:,'VehicleMake'] = X_train['VehicleMake'].map(VehicleMake_map)
X_train.loc[:,'VehicleModel'] = X_train['VehicleModel'].map(VehicleModel_map)
X_train.loc[:,'DayOfWeek'] = X_train['DayOfWeek'].map(DayOfWeek_map)
X_train.loc[:,'MonthOfIncident'] = X_train['MonthOfIncident'].map(MonthOfIncident_map)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1676: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
In [98]:
X_train.columns
Out[98]:
Index(['InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleMake', 'VehicleModel', 'SplitLimit', 'CombinedSingleLimit', 'VehicleAge', 'DayOfWeek', 'MonthOfIncident', 'TimeBetweenCoverageAndIncident'], dtype='object')
In [99]:
X_train.head()
Out[99]:
InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss CustomerLoyaltyPeriod InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleMake VehicleModel SplitLimit CombinedSingleLimit VehicleAge DayOfWeek MonthOfIncident TimeBetweenCoverageAndIncident
24088 36 0 0 0 0 0 0 0 194 0 1000 1440.36 3392565 0 0 0 0 0 0 0 0 17.0 2 0 1 0.0 0 73318.0 9164 9164 54990 0 0 100 300 6579 0 0 6781
4722 28 1 1 1 1 1 0 0 126 1 512 1510.11 0 0 1 1 1 1 1 1 1 3.0 4 0 0 1.0 NaN 68029.0 11367 11367 45295 1 1 100 500 1461 1 0 8397
6441 46 2 1 1 2 2 46100 0 277 2 1764 1098.15 0 1 0 0 0 1 0 2 2 11.0 1 0 1 2.0 NaN 37502.0 4167 4167 29168 2 2 250 500 2245 0 1 1481
24858 38 3 1 1 2 3 0 -60300 189 2 1950 1214.11 0 1 1 1 1 1 2 3 3 23.0 3 1 2 2.0 0 68628.0 11510 5755 51363 3 3 250 300 5903 2 2 3237
22041 29 4 1 2 3 4 0 -66200 124 0 510 1308.77 0 2 1 0 2 2 2 2 4 14.0 3 NaN 0 3.0 1 71719.0 7172 7172 57375 4 4 500 1000 3707 3 1 7828
In [100]:
from sklearn.impute import KNNImputer

cols = X_train.columns
imputer = KNNImputer(n_neighbors=1)
imputed_array = imputer.fit_transform(X_train[cols])

X_train_imp = pd.DataFrame(imputed_array, columns = cols)
In [101]:
# pickle.dump(imputer,open('../Pickle Files/Missing_Value_KNN_Imputer.pkl','wb'))
In [102]:
# import seaborn as sns

# columns= ['PolicyAnnualPremium', 'IncidentTime', 'Witnesses', 'AmountOfTotalClaim']
# def plot_num_dataframes_side_by_side(df1, df2, columns):
#     # Create two subplots
#     fig, axs = plt.subplots(nrows=len(columns), ncols=2, figsize=(15, 5*len(columns)))
    
#     for i, col in enumerate(columns):
#         # Draw the first plot on the left
#         sns.kdeplot(df1[col], ax=axs[i, 0], fill=True, color='teal')
        
#         # Draw the second plot on the right
#         sns.kdeplot(df2[col], ax=axs[i, 1], fill=True, color='indigo')
        
#     # Show the plot
#     plt.show()
    
# plot_num_dataframes_side_by_side(X_train, X_train_imp, columns)
In [103]:
# columns= ['InsuredGender', 'PoliceReport', 'TypeOfCollission', 'VehicleMake']
# def plot_cat_dataframes_side_by_side(df1, df2, columns):
#     # Create two subplots
#     fig, axs = plt.subplots(nrows=len(columns), ncols=2, figsize=(15, 5*len(columns)))
    
#     for i, col in enumerate(columns):
#         # Draw the first plot on the left
#         sns.countplot(df1, x = col, ax=axs[i, 0], palette="magma")
        
#         # Draw the second plot on the right
#         sns.countplot(df2, x = col, ax=axs[i, 1], palette="ocean")
        
#     # Show the plot
#     plt.show()
    
# plot_cat_dataframes_side_by_side(X_train, X_train_imp, columns)
In [104]:
columns= ['InsuredGender', 'PoliceReport', 'TypeOfCollission', 'VehicleMake']

def percent_value_counts(df1, df2, cols = columns):
    for col in cols:
        display(f"Column name: {col}")
        print("Original Data:")
        display(df1[col].value_counts(normalize=True) * 100)
        print("Imputed Data:")
        display(df2[col].value_counts(normalize=True) * 100)
        print("♦♦♦" * 13)
        
percent_value_counts(X_train, X_train_imp, columns)
'Column name: InsuredGender'
Original Data:
1    54.344807
0    45.655193
Name: InsuredGender, dtype: float64
Imputed Data:
1.0    54.342333
0.0    45.657667
Name: InsuredGender, dtype: float64
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦
'Column name: PoliceReport'
Original Data:
0    52.047705
1    47.952295
Name: PoliceReport, dtype: float64
Imputed Data:
0.0    52.905623
1.0    47.094377
Name: PoliceReport, dtype: float64
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦
'Column name: TypeOfCollission'
Original Data:
2    35.978452
0    33.545185
1    30.476363
Name: TypeOfCollission, dtype: float64
Imputed Data:
2.0    37.354471
0.0    34.951697
1.0    27.693832
Name: TypeOfCollission, dtype: float64
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦
'Column name: VehicleMake'
Original Data:
3     8.243585
2     8.050027
8     7.861432
6     7.712542
7     7.608318
1     7.489205
0     7.335352
5     7.300610
9     7.007792
13    6.829123
11    6.769567
4     6.739789
10    5.816666
12    5.235992
Name: VehicleMake, dtype: float64
Imputed Data:
3.0     8.233837
2.0     8.040624
8.0     7.862274
6.0     7.728511
7.0     7.604657
1.0     7.490711
0.0     7.337132
5.0     7.302452
9.0     7.000248
13.0    6.821897
11.0    6.772356
4.0     6.742631
10.0    5.831063
12.0    5.231608
Name: VehicleMake, dtype: float64
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦

4.1.8.2 SMOTE

In [105]:
# get column names
column_names = X_train.columns
display("Column names: ", column_names)

# get column indices
for column_name in column_names:
    column_index = X_train.columns.get_loc(column_name)
    display(f"Column '{column_name}' index: {column_index}")
'Column names: '
Index(['InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleMake', 'VehicleModel', 'SplitLimit', 'CombinedSingleLimit', 'VehicleAge', 'DayOfWeek', 'MonthOfIncident', 'TimeBetweenCoverageAndIncident'], dtype='object')
"Column 'InsuredAge' index: 0"
"Column 'InsuredZipCode' index: 1"
"Column 'InsuredGender' index: 2"
"Column 'InsuredEducationLevel' index: 3"
"Column 'InsuredOccupation' index: 4"
"Column 'InsuredHobbies' index: 5"
"Column 'CapitalGains' index: 6"
"Column 'CapitalLoss' index: 7"
"Column 'CustomerLoyaltyPeriod' index: 8"
"Column 'InsurancePolicyState' index: 9"
"Column 'Policy_Deductible' index: 10"
"Column 'PolicyAnnualPremium' index: 11"
"Column 'UmbrellaLimit' index: 12"
"Column 'InsuredRelationship' index: 13"
"Column 'TypeOfIncident' index: 14"
"Column 'TypeOfCollission' index: 15"
"Column 'SeverityOfIncident' index: 16"
"Column 'AuthoritiesContacted' index: 17"
"Column 'IncidentState' index: 18"
"Column 'IncidentCity' index: 19"
"Column 'IncidentAddress' index: 20"
"Column 'IncidentTime' index: 21"
"Column 'NumberOfVehicles' index: 22"
"Column 'PropertyDamage' index: 23"
"Column 'BodilyInjuries' index: 24"
"Column 'Witnesses' index: 25"
"Column 'PoliceReport' index: 26"
"Column 'AmountOfTotalClaim' index: 27"
"Column 'AmountOfInjuryClaim' index: 28"
"Column 'AmountOfPropertyClaim' index: 29"
"Column 'AmountOfVehicleDamage' index: 30"
"Column 'VehicleMake' index: 31"
"Column 'VehicleModel' index: 32"
"Column 'SplitLimit' index: 33"
"Column 'CombinedSingleLimit' index: 34"
"Column 'VehicleAge' index: 35"
"Column 'DayOfWeek' index: 36"
"Column 'MonthOfIncident' index: 37"
"Column 'TimeBetweenCoverageAndIncident' index: 38"
In [106]:
from imblearn.over_sampling import SMOTENC

sm = SMOTENC(random_state=42, categorical_features=[1,2,3,4,5,9,13,14,15,16,17,18,19,20,23,26,31,32,36,37])

X_train_imp, y_train = sm.fit_resample(X_train_imp, y_train)
In [107]:
#To check the distribution in the target in train and test
print("Train Distribution")
print("↓"*20)
print("Train Unique values counts")
print(y_train.value_counts())
print("Train Unique values percentages")
print(y_train.value_counts()/y_train.value_counts().sum()*100)
print("Train Total Unique values counts")
print(y_train.value_counts().sum())
print("*"*40)
print("Test Distribution")
print("↓"*20)
print("Test Unique value counts")
print(y_test.value_counts())
print("Test Unique values percentages")
print(y_test.value_counts()/y_test.value_counts().sum()*100)
print("Test Total Unique values counts")
print(y_test.value_counts().sum())
Train Distribution
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
Train Unique values counts
N    14736
Y    14736
Name: ReportedFraud, dtype: int64
Train Unique values percentages
N    50.0
Y    50.0
Name: ReportedFraud, dtype: float64
Train Total Unique values counts
29472
****************************************
Test Distribution
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
Test Unique value counts
N    6315
Y    2336
Name: ReportedFraud, dtype: int64
Test Unique values percentages
N    72.997341
Y    27.002659
Name: ReportedFraud, dtype: float64
Test Total Unique values counts
8651
In [108]:
X_train_imp.shape
Out[108]:
(29472, 39)
In [109]:
reverse_InsuredZipCode_map = {v: k for k, v in InsuredZipCode_map.items()}
reverse_InsuredGender_map = {v: k for k, v in InsuredGender_map.items()}
reverse_InsuredEducationLevel_map = {v: k for k, v in InsuredEducationLevel_map.items()}
reverse_InsuredOccupation_map = {v: k for k, v in InsuredOccupation_map.items()}
reverse_InsuredHobbies_map = {v: k for k, v in InsuredHobbies_map.items()}
reverse_InsurancePolicyState_map = {v: k for k, v in InsurancePolicyState_map.items()}
reverse_InsuredRelationship_map = {v: k for k, v in InsuredRelationship_map.items()}
reverse_TypeOfIncident_map = {v: k for k, v in TypeOfIncident_map.items()}
reverse_TypeOfCollission_map = {v: k for k, v in TypeOfCollission_map.items()}
reverse_SeverityOfIncident_map = {v: k for k, v in SeverityOfIncident_map.items()}
reverse_AuthoritiesContacted_map = {v: k for k, v in AuthoritiesContacted_map.items()}
reverse_IncidentState_map = {v: k for k, v in IncidentState_map.items()}
reverse_IncidentCity_map = {v: k for k, v in IncidentCity_map.items()}
reverse_IncidentAddress_map = {v: k for k, v in IncidentAddress_map.items()}
reverse_PropertyDamage_map = {v: k for k, v in PropertyDamage_map.items()}
reverse_PoliceReport_map = {v: k for k, v in PoliceReport_map.items()}
reverse_VehicleMake_map = {v: k for k, v in VehicleMake_map.items()}
reverse_VehicleModel_map = {v: k for k, v in VehicleModel_map.items()}
reverse_DayOfWeek_map = {v: k for k, v in DayOfWeek_map.items()}
reverse_MonthOfIncident_map = {v: k for k, v in MonthOfIncident_map.items()}
In [110]:
X_train_imp.loc[:,'InsuredZipCode'] = X_train_imp['InsuredZipCode'].map(reverse_InsuredZipCode_map)
X_train_imp.loc[:,'InsuredGender'] = X_train_imp['InsuredGender'].map(reverse_InsuredGender_map)
X_train_imp.loc[:,'InsuredEducationLevel'] = X_train_imp['InsuredEducationLevel'].map(reverse_InsuredEducationLevel_map)
X_train_imp.loc[:,'InsuredOccupation'] = X_train_imp['InsuredOccupation'].map(reverse_InsuredOccupation_map)
X_train_imp.loc[:,'InsuredHobbies'] = X_train_imp['InsuredHobbies'].map(reverse_InsuredHobbies_map)
X_train_imp.loc[:,'InsurancePolicyState'] = X_train_imp['InsurancePolicyState'].map(reverse_InsurancePolicyState_map)
X_train_imp.loc[:,'InsuredRelationship'] = X_train_imp['InsuredRelationship'].map(reverse_InsuredRelationship_map)
X_train_imp.loc[:,'TypeOfIncident'] = X_train_imp['TypeOfIncident'].map(reverse_TypeOfIncident_map)
X_train_imp.loc[:,'TypeOfCollission'] = X_train_imp['TypeOfCollission'].map(reverse_TypeOfCollission_map)
X_train_imp.loc[:,'SeverityOfIncident'] = X_train_imp['SeverityOfIncident'].map(reverse_SeverityOfIncident_map)
X_train_imp.loc[:,'AuthoritiesContacted'] = X_train_imp['AuthoritiesContacted'].map(reverse_AuthoritiesContacted_map)
X_train_imp.loc[:,'IncidentState'] = X_train_imp['IncidentState'].map(reverse_IncidentState_map)
X_train_imp.loc[:,'IncidentCity'] = X_train_imp['IncidentCity'].map(reverse_IncidentCity_map)
X_train_imp.loc[:,'IncidentAddress'] = X_train_imp['IncidentAddress'].map(reverse_IncidentAddress_map)
X_train_imp.loc[:,'PropertyDamage'] = X_train_imp['PropertyDamage'].map(reverse_PropertyDamage_map)
X_train_imp.loc[:,'PoliceReport'] = X_train_imp['PoliceReport'].map(reverse_PoliceReport_map)
X_train_imp.loc[:,'VehicleMake'] = X_train_imp['VehicleMake'].map(reverse_VehicleMake_map)
X_train_imp.loc[:,'VehicleModel'] = X_train_imp['VehicleModel'].map(reverse_VehicleModel_map)
X_train_imp.loc[:,'DayOfWeek'] = X_train_imp['DayOfWeek'].map(reverse_DayOfWeek_map)
X_train_imp.loc[:,'MonthOfIncident'] = X_train_imp['MonthOfIncident'].map(reverse_MonthOfIncident_map)
In [111]:
X_train_imp.head()
Out[111]:
InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss CustomerLoyaltyPeriod InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleMake VehicleModel SplitLimit CombinedSingleLimit VehicleAge DayOfWeek MonthOfIncident TimeBetweenCoverageAndIncident
0 36.0 443625 MALE Masters priv-house-serv camping 0.0 0.0 194.0 State3 1000.0 1440.36 3392565.0 other-relative Single Vehicle Collision Side Collision Minor Damage Fire State7 City5 Location 1851 17.0 2.0 NO 1.0 0.0 NO 73318.0 9164.0 9164.0 54990.0 BMW X6 100.0 300.0 6579.0 Tuesday January 6781.0
1 28.0 472248 FEMALE JD machine-op-inspct skydiving 0.0 0.0 126.0 State2 512.0 1510.11 0.0 other-relative Multi-vehicle Collision Front Collision Total Loss Police State5 City4 Location 1159 3.0 4.0 NO 0.0 1.0 NO 68029.0 11367.0 11367.0 45295.0 Ford F150 100.0 500.0 1461.0 Thursday January 8397.0
2 46.0 457875 FEMALE JD sales dancing 46100.0 0.0 277.0 State1 1764.0 1098.15 0.0 wife Single Vehicle Collision Side Collision Minor Damage Police State7 City7 Location 1324 11.0 1.0 NO 1.0 2.0 YES 37502.0 4167.0 4167.0 29168.0 Nissan Maxima 250.0 500.0 2245.0 Tuesday February 1481.0
3 38.0 476303 FEMALE JD sales video-games 0.0 -60300.0 189.0 State1 1950.0 1214.11 0.0 wife Multi-vehicle Collision Front Collision Total Loss Police State8 City1 Location 1073 23.0 3.0 YES 2.0 2.0 NO 68628.0 11510.0 5755.0 51363.0 Saab 95 250.0 300.0 5903.0 Sunday March 3237.0
4 29.0 441726 FEMALE Associate handlers-cleaners golf 0.0 -66200.0 124.0 State3 510.0 1308.77 0.0 own-child Multi-vehicle Collision Side Collision Major Damage Ambulance State8 City7 Location 1443 14.0 3.0 YES 0.0 3.0 YES 71719.0 7172.0 7172.0 57375.0 Audi A3 500.0 1000.0 3707.0 Wednesday February 7828.0
In [112]:
# categorical converrsion

# getting categoric columns in ca_cols variable
cat_cols = ['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation',
            'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident',
            'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState',
            'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 
             'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident']

# calling the convert_columns_types_to_category() function defined above
X_train_imp = convert_columns_types_to_category(X_train_imp, cols=cat_cols, col_type = 'category')
'### Before conversion: ###'
InsuredAge                        float64
InsuredZipCode                      int64
InsuredGender                      object
InsuredEducationLevel              object
InsuredOccupation                  object
InsuredHobbies                     object
CapitalGains                      float64
CapitalLoss                       float64
CustomerLoyaltyPeriod             float64
InsurancePolicyState               object
Policy_Deductible                 float64
PolicyAnnualPremium               float64
UmbrellaLimit                     float64
InsuredRelationship                object
TypeOfIncident                     object
TypeOfCollission                   object
SeverityOfIncident                 object
AuthoritiesContacted               object
IncidentState                      object
IncidentCity                       object
IncidentAddress                    object
IncidentTime                      float64
NumberOfVehicles                  float64
PropertyDamage                     object
BodilyInjuries                    float64
Witnesses                         float64
PoliceReport                       object
AmountOfTotalClaim                float64
AmountOfInjuryClaim               float64
AmountOfPropertyClaim             float64
AmountOfVehicleDamage             float64
VehicleMake                        object
VehicleModel                       object
SplitLimit                        float64
CombinedSingleLimit               float64
VehicleAge                        float64
DayOfWeek                          object
MonthOfIncident                    object
TimeBetweenCoverageAndIncident    float64
dtype: object
'### After conversion: ###'
InsuredAge                         float64
InsuredZipCode                    category
InsuredGender                     category
InsuredEducationLevel             category
InsuredOccupation                 category
InsuredHobbies                    category
CapitalGains                       float64
CapitalLoss                        float64
CustomerLoyaltyPeriod              float64
InsurancePolicyState              category
Policy_Deductible                  float64
PolicyAnnualPremium                float64
UmbrellaLimit                      float64
InsuredRelationship               category
TypeOfIncident                    category
TypeOfCollission                  category
SeverityOfIncident                category
AuthoritiesContacted              category
IncidentState                     category
IncidentCity                      category
IncidentAddress                   category
IncidentTime                       float64
NumberOfVehicles                   float64
PropertyDamage                    category
BodilyInjuries                     float64
Witnesses                          float64
PoliceReport                      category
AmountOfTotalClaim                 float64
AmountOfInjuryClaim                float64
AmountOfPropertyClaim              float64
AmountOfVehicleDamage              float64
VehicleMake                       category
VehicleModel                      category
SplitLimit                         float64
CombinedSingleLimit                float64
VehicleAge                         float64
DayOfWeek                         category
MonthOfIncident                   category
TimeBetweenCoverageAndIncident     float64
dtype: object

4.1.8.3 Validation Data

In [113]:
msno.matrix(X_test)
plt.figure(figsize = (15,9))
plt.show()
<Figure size 1080x648 with 0 Axes>
In [114]:
msno.heatmap(X_test,cmap="PiYG", figsize=(20,10), fontsize=13);
In [115]:
msno.dendrogram(X_test, figsize=(20,15), fontsize=15, label_rotation=90);
In [116]:
missing_percentage(X_test)
Out[116]:
Total Percentage
PropertyDamage 3163 36.56
PoliceReport 2952 34.12
TypeOfCollission 1498 17.32
PolicyAnnualPremium 42 0.49
AmountOfTotalClaim 14 0.16
VehicleMake 14 0.16
Witnesses 12 0.14
IncidentTime 9 0.10
InsuredGender 7 0.08
InsuredAge 0 0.00
AmountOfInjuryClaim 0 0.00
BodilyInjuries 0 0.00
AmountOfVehicleDamage 0 0.00
AmountOfPropertyClaim 0 0.00
VehicleModel 0 0.00
SplitLimit 0 0.00
CombinedSingleLimit 0 0.00
VehicleAge 0 0.00
DayOfWeek 0 0.00
MonthOfIncident 0 0.00
NumberOfVehicles 0 0.00
IncidentCity 0 0.00
IncidentAddress 0 0.00
InsurancePolicyState 0 0.00
InsuredEducationLevel 0 0.00
InsuredOccupation 0 0.00
InsuredHobbies 0 0.00
CapitalGains 0 0.00
CapitalLoss 0 0.00
CustomerLoyaltyPeriod 0 0.00
Policy_Deductible 0 0.00
InsuredZipCode 0 0.00
UmbrellaLimit 0 0.00
InsuredRelationship 0 0.00
TypeOfIncident 0 0.00
SeverityOfIncident 0 0.00
AuthoritiesContacted 0 0.00
IncidentState 0 0.00
TimeBetweenCoverageAndIncident 0 0.00
In [117]:
X_test.loc[:,'InsuredZipCode'] = X_test['InsuredZipCode'].map(InsuredZipCode_map)
X_test.loc[:,'InsuredGender'] = X_test['InsuredGender'].map(InsuredGender_map)
X_test.loc[:,'InsuredEducationLevel'] = X_test['InsuredEducationLevel'].map(InsuredEducationLevel_map)
X_test.loc[:,'InsuredOccupation'] = X_test['InsuredOccupation'].map(InsuredOccupation_map)
X_test.loc[:,'InsuredHobbies'] = X_test['InsuredHobbies'].map(InsuredHobbies_map)
X_test.loc[:,'InsurancePolicyState'] = X_test['InsurancePolicyState'].map(InsurancePolicyState_map)
X_test.loc[:,'InsuredRelationship'] = X_test['InsuredRelationship'].map(InsuredRelationship_map)
X_test.loc[:,'TypeOfIncident'] = X_test['TypeOfIncident'].map(TypeOfIncident_map)
X_test.loc[:,'TypeOfCollission'] = X_test['TypeOfCollission'].map(TypeOfCollission_map)
X_test.loc[:,'SeverityOfIncident'] = X_test['SeverityOfIncident'].map(SeverityOfIncident_map)
X_test.loc[:,'AuthoritiesContacted'] = X_test['AuthoritiesContacted'].map(AuthoritiesContacted_map)
X_test.loc[:,'IncidentState'] = X_test['IncidentState'].map(IncidentState_map)
X_test.loc[:,'IncidentCity'] = X_test['IncidentCity'].map(IncidentCity_map)
X_test.loc[:,'IncidentAddress'] = X_test['IncidentAddress'].map(IncidentAddress_map)
X_test.loc[:,'PropertyDamage'] = X_test['PropertyDamage'].map(PropertyDamage_map)
X_test.loc[:,'PoliceReport'] = X_test['PoliceReport'].map(PoliceReport_map)
X_test.loc[:,'VehicleMake'] = X_test['VehicleMake'].map(VehicleMake_map)
X_test.loc[:,'VehicleModel'] = X_test['VehicleModel'].map(VehicleModel_map)
X_test.loc[:,'DayOfWeek'] = X_test['DayOfWeek'].map(DayOfWeek_map)
X_test.loc[:,'MonthOfIncident'] = X_test['MonthOfIncident'].map(MonthOfIncident_map)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1676: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
In [118]:
cols = X_test.columns
imputed_array = imputer.transform(X_test[cols])

X_test_imp = pd.DataFrame(imputed_array, columns = cols)
In [119]:
# columns= ['PolicyAnnualPremium', 'IncidentTime', 'Witnesses', 'AmountOfTotalClaim']
# plot_num_dataframes_side_by_side(X_test, X_test_imp, columns)
In [120]:
# columns= ['InsuredGender', 'PoliceReport', 'TypeOfCollission', 'VehicleMake']
# plot_cat_dataframes_side_by_side(X_test, X_test_imp, columns)
In [121]:
columns= ['InsuredGender', 'PoliceReport', 'TypeOfCollission', 'VehicleMake']
percent_value_counts(X_test, X_test_imp, columns)
'Column name: InsuredGender'
Original Data:
1    54.222582
0    45.777418
Name: InsuredGender, dtype: float64
Imputed Data:
1.0    54.190267
0.0    45.809733
Name: InsuredGender, dtype: float64
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦
'Column name: PoliceReport'
Original Data:
0    51.92139
1    48.07861
Name: PoliceReport, dtype: float64
Imputed Data:
0.0    52.710669
1.0    47.289331
Name: PoliceReport, dtype: float64
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦
'Column name: TypeOfCollission'
Original Data:
2    36.586048
0    32.503845
1    30.910108
Name: TypeOfCollission, dtype: float64
Imputed Data:
2.0    37.567911
0.0    34.157901
1.0    28.274188
Name: TypeOfCollission, dtype: float64
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦
'Column name: VehicleMake'
Original Data:
3     8.729883
8     8.440431
6     8.208869
2     7.849948
1     7.514183
7     7.421558
5     7.271043
11    6.900544
0     6.888966
4     6.877388
13    6.599514
9     6.587936
10    5.638532
12    5.071205
Name: VehicleMake, dtype: float64
Imputed Data:
3.0     8.727315
8.0     8.438331
6.0     8.218703
2.0     7.848804
1.0     7.513582
7.0     7.409548
5.0     7.270836
4.0     6.889377
11.0    6.889377
0.0     6.877818
9.0     6.600393
13.0    6.600393
10.0    5.640966
12.0    5.074558
Name: VehicleMake, dtype: float64
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦
In [122]:
X_test_imp.loc[:,'InsuredZipCode'] = X_test_imp['InsuredZipCode'].map(reverse_InsuredZipCode_map)
X_test_imp.loc[:,'InsuredGender'] = X_test_imp['InsuredGender'].map(reverse_InsuredGender_map)
X_test_imp.loc[:,'InsuredEducationLevel'] = X_test_imp['InsuredEducationLevel'].map(reverse_InsuredEducationLevel_map)
X_test_imp.loc[:,'InsuredOccupation'] = X_test_imp['InsuredOccupation'].map(reverse_InsuredOccupation_map)
X_test_imp.loc[:,'InsuredHobbies'] = X_test_imp['InsuredHobbies'].map(reverse_InsuredHobbies_map)
X_test_imp.loc[:,'InsurancePolicyState'] = X_test_imp['InsurancePolicyState'].map(reverse_InsurancePolicyState_map)
X_test_imp.loc[:,'InsuredRelationship'] = X_test_imp['InsuredRelationship'].map(reverse_InsuredRelationship_map)
X_test_imp.loc[:,'TypeOfIncident'] = X_test_imp['TypeOfIncident'].map(reverse_TypeOfIncident_map)
X_test_imp.loc[:,'TypeOfCollission'] = X_test_imp['TypeOfCollission'].map(reverse_TypeOfCollission_map)
X_test_imp.loc[:,'SeverityOfIncident'] = X_test_imp['SeverityOfIncident'].map(reverse_SeverityOfIncident_map)
X_test_imp.loc[:,'AuthoritiesContacted'] = X_test_imp['AuthoritiesContacted'].map(reverse_AuthoritiesContacted_map)
X_test_imp.loc[:,'IncidentState'] = X_test_imp['IncidentState'].map(reverse_IncidentState_map)
X_test_imp.loc[:,'IncidentCity'] = X_test_imp['IncidentCity'].map(reverse_IncidentCity_map)
X_test_imp.loc[:,'IncidentAddress'] = X_test_imp['IncidentAddress'].map(reverse_IncidentAddress_map)
X_test_imp.loc[:,'PropertyDamage'] = X_test_imp['PropertyDamage'].map(reverse_PropertyDamage_map)
X_test_imp.loc[:,'PoliceReport'] = X_test_imp['PoliceReport'].map(reverse_PoliceReport_map)
X_test_imp.loc[:,'VehicleMake'] = X_test_imp['VehicleMake'].map(reverse_VehicleMake_map)
X_test_imp.loc[:,'VehicleModel'] = X_test_imp['VehicleModel'].map(reverse_VehicleModel_map)
X_test_imp.loc[:,'DayOfWeek'] = X_test_imp['DayOfWeek'].map(reverse_DayOfWeek_map)
X_test_imp.loc[:,'MonthOfIncident'] = X_test_imp['MonthOfIncident'].map(reverse_MonthOfIncident_map)
In [123]:
X_test_imp.head()
Out[123]:
InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss CustomerLoyaltyPeriod InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleMake VehicleModel SplitLimit CombinedSingleLimit VehicleAge DayOfWeek MonthOfIncident TimeBetweenCoverageAndIncident
0 41.0 605743 MALE JD tech-support paintball 0.0 0.0 244.0 State2 500.0 1105.67 0.0 other-relative Vehicle Theft Rear Collision Minor Damage None State8 City1 Location 1759 8.0 1.0 YES 1.0 2.0 YES 7013.0 1169.0 584.0 5260.0 Accura Pathfinder 250.0 500.0 2570.0 Wednesday January 4415.0
1 26.0 600313 FEMALE MD priv-house-serv paintball 0.0 0.0 100.0 State1 1000.0 1601.50 0.0 husband Multi-vehicle Collision Rear Collision Total Loss Ambulance State9 City6 Location 1419 7.0 3.0 NO 1.0 1.0 NO 72236.0 13927.0 6530.0 51779.0 Accura MDX 250.0 500.0 6257.0 Wednesday February 1716.0
2 35.0 434247 MALE High School exec-managerial kayaking 31900.0 -44600.0 130.0 State2 500.0 1123.82 0.0 own-child Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 1254 19.0 4.0 YES 1.0 2.0 YES 58405.0 7525.0 7525.0 43355.0 Volkswagen Jetta 250.0 500.0 2568.0 Monday January 7216.0
3 40.0 459984 MALE Masters armed-forces skydiving 50000.0 -56900.0 191.0 State2 929.0 1002.83 0.0 husband Single Vehicle Collision Front Collision Total Loss Other State7 City4 Location 1281 11.0 1.0 YES 0.0 2.0 NO 73103.0 7964.0 14060.0 51079.0 Chevrolet A3 500.0 1000.0 2967.0 Sunday February 1526.0
4 37.0 438215 FEMALE High School transport-moving basketball 52300.0 0.0 153.0 State3 590.0 994.37 0.0 not-in-family Multi-vehicle Collision Front Collision Minor Damage Ambulance State5 City5 Location 1262 4.0 3.0 NO 0.0 3.0 NO 43552.0 4934.0 4934.0 33684.0 Nissan Jetta 250.0 500.0 5859.0 Friday January 4116.0
In [124]:
# categorical converrsion

# getting categoric columns in ca_cols variable
cat_cols = ['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation',
            'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident',
            'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState',
            'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 
             'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident']

# calling the convert_columns_types_to_category() function defined above
X_test_imp = convert_columns_types_to_category(X_test_imp, cols=cat_cols, col_type = 'category')
'### Before conversion: ###'
InsuredAge                        float64
InsuredZipCode                      int64
InsuredGender                      object
InsuredEducationLevel              object
InsuredOccupation                  object
InsuredHobbies                     object
CapitalGains                      float64
CapitalLoss                       float64
CustomerLoyaltyPeriod             float64
InsurancePolicyState               object
Policy_Deductible                 float64
PolicyAnnualPremium               float64
UmbrellaLimit                     float64
InsuredRelationship                object
TypeOfIncident                     object
TypeOfCollission                   object
SeverityOfIncident                 object
AuthoritiesContacted               object
IncidentState                      object
IncidentCity                       object
IncidentAddress                    object
IncidentTime                      float64
NumberOfVehicles                  float64
PropertyDamage                     object
BodilyInjuries                    float64
Witnesses                         float64
PoliceReport                       object
AmountOfTotalClaim                float64
AmountOfInjuryClaim               float64
AmountOfPropertyClaim             float64
AmountOfVehicleDamage             float64
VehicleMake                        object
VehicleModel                       object
SplitLimit                        float64
CombinedSingleLimit               float64
VehicleAge                        float64
DayOfWeek                          object
MonthOfIncident                    object
TimeBetweenCoverageAndIncident    float64
dtype: object
'### After conversion: ###'
InsuredAge                         float64
InsuredZipCode                    category
InsuredGender                     category
InsuredEducationLevel             category
InsuredOccupation                 category
InsuredHobbies                    category
CapitalGains                       float64
CapitalLoss                        float64
CustomerLoyaltyPeriod              float64
InsurancePolicyState              category
Policy_Deductible                  float64
PolicyAnnualPremium                float64
UmbrellaLimit                      float64
InsuredRelationship               category
TypeOfIncident                    category
TypeOfCollission                  category
SeverityOfIncident                category
AuthoritiesContacted              category
IncidentState                     category
IncidentCity                      category
IncidentAddress                   category
IncidentTime                       float64
NumberOfVehicles                   float64
PropertyDamage                    category
BodilyInjuries                     float64
Witnesses                          float64
PoliceReport                      category
AmountOfTotalClaim                 float64
AmountOfInjuryClaim                float64
AmountOfPropertyClaim              float64
AmountOfVehicleDamage              float64
VehicleMake                       category
VehicleModel                      category
SplitLimit                         float64
CombinedSingleLimit                float64
VehicleAge                         float64
DayOfWeek                         category
MonthOfIncident                   category
TimeBetweenCoverageAndIncident     float64
dtype: object

4.1.8.4 Test Data

In [125]:
msno.matrix(test_df)
plt.figure(figsize = (15,9))
plt.show()
<Figure size 1080x648 with 0 Axes>
In [126]:
msno.heatmap(test_df,cmap="PiYG", figsize=(20,10), fontsize=13);
In [127]:
msno.dendrogram(test_df, figsize=(20,15), fontsize=15, label_rotation=90);
In [128]:
missing_percentage(test_df)
Out[128]:
Total Percentage
PropertyDamage 3199 35.90
PoliceReport 3014 33.82
TypeOfCollission 1763 19.78
PolicyAnnualPremium 47 0.53
Witnesses 12 0.13
InsuredGender 8 0.09
AmountOfTotalClaim 8 0.09
VehicleMake 8 0.09
IncidentTime 7 0.08
InsuredAge 0 0.00
AmountOfInjuryClaim 0 0.00
BodilyInjuries 0 0.00
AmountOfVehicleDamage 0 0.00
AmountOfPropertyClaim 0 0.00
VehicleModel 0 0.00
SplitLimit 0 0.00
CombinedSingleLimit 0 0.00
VehicleAge 0 0.00
DayOfWeek 0 0.00
MonthOfIncident 0 0.00
NumberOfVehicles 0 0.00
IncidentCity 0 0.00
IncidentAddress 0 0.00
InsurancePolicyState 0 0.00
InsuredEducationLevel 0 0.00
InsuredOccupation 0 0.00
InsuredHobbies 0 0.00
CapitalGains 0 0.00
CapitalLoss 0 0.00
CustomerLoyaltyPeriod 0 0.00
Policy_Deductible 0 0.00
InsuredZipCode 0 0.00
UmbrellaLimit 0 0.00
InsuredRelationship 0 0.00
TypeOfIncident 0 0.00
SeverityOfIncident 0 0.00
AuthoritiesContacted 0 0.00
IncidentState 0 0.00
TimeBetweenCoverageAndIncident 0 0.00
In [129]:
test_df.loc[:,'InsuredZipCode'] = test_df['InsuredZipCode'].map(InsuredZipCode_map)
test_df.loc[:,'InsuredGender'] = test_df['InsuredGender'].map(InsuredGender_map)
test_df.loc[:,'InsuredEducationLevel'] = test_df['InsuredEducationLevel'].map(InsuredEducationLevel_map)
test_df.loc[:,'InsuredOccupation'] = test_df['InsuredOccupation'].map(InsuredOccupation_map)
test_df.loc[:,'InsuredHobbies'] = test_df['InsuredHobbies'].map(InsuredHobbies_map)
test_df.loc[:,'InsurancePolicyState'] = test_df['InsurancePolicyState'].map(InsurancePolicyState_map)
test_df.loc[:,'InsuredRelationship'] = test_df['InsuredRelationship'].map(InsuredRelationship_map)
test_df.loc[:,'TypeOfIncident'] = test_df['TypeOfIncident'].map(TypeOfIncident_map)
test_df.loc[:,'TypeOfCollission'] = test_df['TypeOfCollission'].map(TypeOfCollission_map)
test_df.loc[:,'SeverityOfIncident'] = test_df['SeverityOfIncident'].map(SeverityOfIncident_map)
test_df.loc[:,'AuthoritiesContacted'] = test_df['AuthoritiesContacted'].map(AuthoritiesContacted_map)
test_df.loc[:,'IncidentState'] = test_df['IncidentState'].map(IncidentState_map)
test_df.loc[:,'IncidentCity'] = test_df['IncidentCity'].map(IncidentCity_map)
test_df.loc[:,'IncidentAddress'] = test_df['IncidentAddress'].map(IncidentAddress_map)
test_df.loc[:,'PropertyDamage'] = test_df['PropertyDamage'].map(PropertyDamage_map)
test_df.loc[:,'PoliceReport'] = test_df['PoliceReport'].map(PoliceReport_map)
test_df.loc[:,'VehicleMake'] = test_df['VehicleMake'].map(VehicleMake_map)
test_df.loc[:,'VehicleModel'] = test_df['VehicleModel'].map(VehicleModel_map)
test_df.loc[:,'DayOfWeek'] = test_df['DayOfWeek'].map(DayOfWeek_map)
test_df.loc[:,'MonthOfIncident'] = test_df['MonthOfIncident'].map(MonthOfIncident_map)
In [130]:
cols = test_df.columns
imputed_array = imputer.transform(test_df[cols])

test_df_imp = pd.DataFrame(imputed_array, columns = cols)
In [131]:
test_df_imp.isnull().sum()
Out[131]:
InsuredAge                        0
InsuredZipCode                    0
InsuredGender                     0
InsuredEducationLevel             0
InsuredOccupation                 0
InsuredHobbies                    0
CapitalGains                      0
CapitalLoss                       0
CustomerLoyaltyPeriod             0
InsurancePolicyState              0
Policy_Deductible                 0
PolicyAnnualPremium               0
UmbrellaLimit                     0
InsuredRelationship               0
TypeOfIncident                    0
TypeOfCollission                  0
SeverityOfIncident                0
AuthoritiesContacted              0
IncidentState                     0
IncidentCity                      0
IncidentAddress                   0
IncidentTime                      0
NumberOfVehicles                  0
PropertyDamage                    0
BodilyInjuries                    0
Witnesses                         0
PoliceReport                      0
AmountOfTotalClaim                0
AmountOfInjuryClaim               0
AmountOfPropertyClaim             0
AmountOfVehicleDamage             0
VehicleMake                       0
VehicleModel                      0
SplitLimit                        0
CombinedSingleLimit               0
VehicleAge                        0
DayOfWeek                         0
MonthOfIncident                   0
TimeBetweenCoverageAndIncident    0
dtype: int64
In [132]:
test_df_imp.loc[:,'InsuredZipCode'] = test_df_imp['InsuredZipCode'].map(reverse_InsuredZipCode_map)
test_df_imp.loc[:,'InsuredGender'] = test_df_imp['InsuredGender'].map(reverse_InsuredGender_map)
test_df_imp.loc[:,'InsuredEducationLevel'] = test_df_imp['InsuredEducationLevel'].map(reverse_InsuredEducationLevel_map)
test_df_imp.loc[:,'InsuredOccupation'] = test_df_imp['InsuredOccupation'].map(reverse_InsuredOccupation_map)
test_df_imp.loc[:,'InsuredHobbies'] = test_df_imp['InsuredHobbies'].map(reverse_InsuredHobbies_map)
test_df_imp.loc[:,'InsurancePolicyState'] = test_df_imp['InsurancePolicyState'].map(reverse_InsurancePolicyState_map)
test_df_imp.loc[:,'InsuredRelationship'] = test_df_imp['InsuredRelationship'].map(reverse_InsuredRelationship_map)
test_df_imp.loc[:,'TypeOfIncident'] = test_df_imp['TypeOfIncident'].map(reverse_TypeOfIncident_map)
test_df_imp.loc[:,'TypeOfCollission'] = test_df_imp['TypeOfCollission'].map(reverse_TypeOfCollission_map)
test_df_imp.loc[:,'SeverityOfIncident'] = test_df_imp['SeverityOfIncident'].map(reverse_SeverityOfIncident_map)
test_df_imp.loc[:,'AuthoritiesContacted'] = test_df_imp['AuthoritiesContacted'].map(reverse_AuthoritiesContacted_map)
test_df_imp.loc[:,'IncidentState'] = test_df_imp['IncidentState'].map(reverse_IncidentState_map)
test_df_imp.loc[:,'IncidentCity'] = test_df_imp['IncidentCity'].map(reverse_IncidentCity_map)
test_df_imp.loc[:,'IncidentAddress'] = test_df_imp['IncidentAddress'].map(reverse_IncidentAddress_map)
test_df_imp.loc[:,'PropertyDamage'] = test_df_imp['PropertyDamage'].map(reverse_PropertyDamage_map)
test_df_imp.loc[:,'PoliceReport'] = test_df_imp['PoliceReport'].map(reverse_PoliceReport_map)
test_df_imp.loc[:,'VehicleMake'] = test_df_imp['VehicleMake'].map(reverse_VehicleMake_map)
test_df_imp.loc[:,'VehicleModel'] = test_df_imp['VehicleModel'].map(reverse_VehicleModel_map)
test_df_imp.loc[:,'DayOfWeek'] = test_df_imp['DayOfWeek'].map(reverse_DayOfWeek_map)
test_df_imp.loc[:,'MonthOfIncident'] = test_df_imp['MonthOfIncident'].map(reverse_MonthOfIncident_map)
In [133]:
test_df_imp.head()
Out[133]:
InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss CustomerLoyaltyPeriod InsurancePolicyState Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleMake VehicleModel SplitLimit CombinedSingleLimit VehicleAge DayOfWeek MonthOfIncident TimeBetweenCoverageAndIncident
0 27.0 471704 FEMALE High School adm-clerical base-jumping 56400.0 -57000.0 84.0 State2 2000.0 1006.00 0.0 own-child Multi-vehicle Collision Front Collision Minor Damage Ambulance State5 City2 Location 1354 4.0 3.0 NO 0.0 0.0 NO 68354.0 6835.0 8059.0 53460.0 Volkswagen Passat 500.0 1000.0 7340.0 Thursday February 6115.0
1 40.0 455810 FEMALE MD prof-specialty golf 56700.0 -65600.0 232.0 State3 500.0 1279.17 0.0 unmarried Single Vehicle Collision Rear Collision Minor Damage Fire State9 City5 Location 1383 16.0 1.0 NO 1.0 1.0 YES 55270.0 8113.0 5240.0 41917.0 Nissan Ultima 100.0 300.0 3299.0 Tuesday January 1160.0
2 39.0 461919 MALE JD other-service movies 30400.0 0.0 218.0 State2 1000.0 1454.67 1235986.0 other-relative Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 2030 20.0 3.0 NO 0.0 1.0 NO 59515.0 7490.0 9110.0 42915.0 Suburu Impreza 250.0 500.0 1830.0 Monday January 1633.0
3 38.0 600904 FEMALE Masters exec-managerial video-games 68500.0 0.0 205.0 State3 2000.0 1287.76 5873212.0 wife Vehicle Theft Rear Collision Trivial Damage None State7 City5 Location 1449 10.0 1.0 NO 2.0 1.0 NO 4941.0 494.0 866.0 3581.0 Accura TL 500.0 500.0 2193.0 Saturday January 5228.0
4 29.0 430632 FEMALE PhD sales board-games 35100.0 0.0 134.0 State3 2000.0 1413.14 5000000.0 own-child Multi-vehicle Collision Rear Collision Minor Damage Police State5 City2 Location 1916 7.0 3.0 NO 2.0 3.0 NO 34650.0 7700.0 3850.0 23100.0 Dodge RAM 100.0 300.0 2974.0 Sunday February 5282.0
In [134]:
# categorical converrsion

# getting categoric columns in ca_cols variable
cat_cols = ['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation',
            'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident',
            'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState',
            'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 
             'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident']

# calling the convert_columns_types_to_category() function defined above
test_df_imp = convert_columns_types_to_category(test_df_imp, cols=cat_cols, col_type = 'category')
'### Before conversion: ###'
InsuredAge                        float64
InsuredZipCode                      int64
InsuredGender                      object
InsuredEducationLevel              object
InsuredOccupation                  object
InsuredHobbies                     object
CapitalGains                      float64
CapitalLoss                       float64
CustomerLoyaltyPeriod             float64
InsurancePolicyState               object
Policy_Deductible                 float64
PolicyAnnualPremium               float64
UmbrellaLimit                     float64
InsuredRelationship                object
TypeOfIncident                     object
TypeOfCollission                   object
SeverityOfIncident                 object
AuthoritiesContacted               object
IncidentState                      object
IncidentCity                       object
IncidentAddress                    object
IncidentTime                      float64
NumberOfVehicles                  float64
PropertyDamage                     object
BodilyInjuries                    float64
Witnesses                         float64
PoliceReport                       object
AmountOfTotalClaim                float64
AmountOfInjuryClaim               float64
AmountOfPropertyClaim             float64
AmountOfVehicleDamage             float64
VehicleMake                        object
VehicleModel                       object
SplitLimit                        float64
CombinedSingleLimit               float64
VehicleAge                        float64
DayOfWeek                          object
MonthOfIncident                    object
TimeBetweenCoverageAndIncident    float64
dtype: object
'### After conversion: ###'
InsuredAge                         float64
InsuredZipCode                    category
InsuredGender                     category
InsuredEducationLevel             category
InsuredOccupation                 category
InsuredHobbies                    category
CapitalGains                       float64
CapitalLoss                        float64
CustomerLoyaltyPeriod              float64
InsurancePolicyState              category
Policy_Deductible                  float64
PolicyAnnualPremium                float64
UmbrellaLimit                      float64
InsuredRelationship               category
TypeOfIncident                    category
TypeOfCollission                  category
SeverityOfIncident                category
AuthoritiesContacted              category
IncidentState                     category
IncidentCity                      category
IncidentAddress                   category
IncidentTime                       float64
NumberOfVehicles                   float64
PropertyDamage                    category
BodilyInjuries                     float64
Witnesses                          float64
PoliceReport                      category
AmountOfTotalClaim                 float64
AmountOfInjuryClaim                float64
AmountOfPropertyClaim              float64
AmountOfVehicleDamage              float64
VehicleMake                       category
VehicleModel                      category
SplitLimit                         float64
CombinedSingleLimit                float64
VehicleAge                         float64
DayOfWeek                         category
MonthOfIncident                   category
TimeBetweenCoverageAndIncident     float64
dtype: object

4.1.9 Splitting Data into Categorical and Numerical DataFrames

In this step, we split the entire data into numeric and categorical data, to perform operations respective to their data type.

4.1.9.1 Train Data

In [135]:
def get_num_cat_dataframes(DataFrame):
    num_df = DataFrame.select_dtypes(include=['int','float']) # Assigning the columns which are of 'int' & 'float' type to num_df
    print(num_df.shape)
    cat_df = DataFrame.select_dtypes(include=['category']) # Assigning the columns which are of 'category' type to cat_df
    print(cat_df.shape)
    return num_df, cat_df

X_train_num, X_train_cat = get_num_cat_dataframes(X_train_imp)
(29472, 19)
(29472, 20)
In [136]:
X_train_num.columns
Out[136]:
Index(['InsuredAge', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'IncidentTime', 'NumberOfVehicles', 'BodilyInjuries', 'Witnesses', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'SplitLimit', 'CombinedSingleLimit', 'VehicleAge', 'TimeBetweenCoverageAndIncident'], dtype='object')
In [137]:
X_train_cat.columns
Out[137]:
Index(['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident'], dtype='object')

4.1.9.2 Validation Data

In [138]:
X_test_num, X_test_cat = get_num_cat_dataframes(X_test_imp)
(8651, 19)
(8651, 20)
In [139]:
X_test_num.columns
Out[139]:
Index(['InsuredAge', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'IncidentTime', 'NumberOfVehicles', 'BodilyInjuries', 'Witnesses', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'SplitLimit', 'CombinedSingleLimit', 'VehicleAge', 'TimeBetweenCoverageAndIncident'], dtype='object')
In [140]:
X_test_cat.columns
Out[140]:
Index(['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident'], dtype='object')

4.1.9.3 Test Data

In [141]:
test_df_num, test_df_cat = get_num_cat_dataframes(test_df_imp)
(8912, 19)
(8912, 20)
In [142]:
test_df_num.columns
Out[142]:
Index(['InsuredAge', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'IncidentTime', 'NumberOfVehicles', 'BodilyInjuries', 'Witnesses', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'SplitLimit', 'CombinedSingleLimit', 'VehicleAge', 'TimeBetweenCoverageAndIncident'], dtype='object')
In [143]:
test_df_cat.columns
Out[143]:
Index(['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident'], dtype='object')

4.1.10 Stardardization of Numeric DataFrames

Standardization of numeric data frames is performed to bring all the features to the same scale or range. Standardization makes it easier to compare the features and identify the most important ones in the dataset. It also helps in improving the performance of certain machine learning algorithms that are sensitive to the scale of the features.

When we have features with different scales, some machine learning algorithms may give more weight to features with larger values, even if those features are not necessarily more important than the others. Standardization rescales the features so that they have a mean of zero and a standard deviation of one. This way, all the features are given equal importance in the algorithm.

Standardization is also useful when there are outliers in the data. Outliers can have a significant impact on the mean and standard deviation of a feature, making it difficult to compare with the other features. Standardization can help reduce the influence of outliers by centering the data around the mean.

Overall, standardization is an important preprocessing step that can help improve the accuracy and reliability of machine learning models.

4.1.10.1 Train Data & Validation Data

In [144]:
from sklearn.preprocessing import StandardScaler

def perform_standardization(X_train,X_test):
    scaler = StandardScaler()
    num_attr = X_train.select_dtypes(['int','float']).columns
    print(num_attr)
    scaler.fit(X_train[num_attr])
    X_train[num_attr]=scaler.transform(X_train[num_attr])
    X_test[num_attr]=scaler.transform(X_test[num_attr])
    return scaler
    
scaler = perform_standardization(X_train_num,X_test_num)
Index(['InsuredAge', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'IncidentTime', 'NumberOfVehicles', 'BodilyInjuries', 'Witnesses', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'SplitLimit', 'CombinedSingleLimit', 'VehicleAge', 'TimeBetweenCoverageAndIncident'], dtype='object')
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:692: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:692: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)
In [145]:
# pickle.dump(scaler,open('../Pickle Files/StandardScaler.pkl','wb'))

4.1.10.2 Test Data

In [146]:
def perform_standardization_one_df(test_df, scaler):
    num_attr = test_df.select_dtypes(['int','float']).columns
    print(num_attr)
    test_df[num_attr]=scaler.transform(test_df[num_attr])
In [147]:
perform_standardization_one_df(test_df_num, scaler)
Index(['InsuredAge', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'IncidentTime', 'NumberOfVehicles', 'BodilyInjuries', 'Witnesses', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'SplitLimit', 'CombinedSingleLimit', 'VehicleAge', 'TimeBetweenCoverageAndIncident'], dtype='object')
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:692: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)

4.1.11 Categorical Encoding

Categorical encoding is the process of converting categorical data, which consists of values that represent specific categories, into numerical data that can be used as input for machine learning models. There are several different methods for categorical encoding, including:

1. One-Hot Encoding:

  • In this method, each unique category value is converted into a binary vector with a length equal to the total number of unique categories. Each binary vector has a single 1-value in the position corresponding to the index of the category, and 0-values elsewhere. This creates a sparse matrix, which can be memory-intensive for large datasets with many unique categories.

2. Label Encoding:

  • In this method, each unique category value is assigned a numerical label, with the labels ranging from 0 to the total number of unique categories minus 1. This creates an ordinal variable, which can be problematic for categorical data with no inherent order.

3. Ordinal Encoding:

  • This is similar to label encoding, but instead of assigning labels arbitrarily, they are assigned based on the ordinal relationship between the categories. For example, in a dataset with categories "low", "medium", and "high", the labels could be assigned as 0, 1, and 2, respectively. This method can help preserve the relative order of the categories, but it assumes that the distance between the categories is equal, which may not always be true.

3. Target Encoding:

  • In this method, each unique category value is replaced with the mean of the target variable for that category. This can be useful for creating features that capture the relationship between the categorical variable and the target variable, but it can also lead to overfitting if the target variable is highly correlated with the categorical variable.
In [148]:
for col in X_train_cat.columns:
    col_data = X_train_cat[col]
    unique_counts = col_data.value_counts()
    print(f"Column name: {col}")
    print(f"Data type: {col_data.dtype}")
    print(f"Value counts:")
    print(unique_counts)
    print("-------------------------")
Column name: InsuredZipCode
Data type: category
Value counts:
476198    176
608331    148
478456    131
474801    123
446895    123
         ... 
605220      7
441714      7
470610      7
603269      6
448984      5
Name: InsuredZipCode, Length: 995, dtype: int64
-------------------------
Column name: InsuredGender
Data type: category
Value counts:
FEMALE    15924
MALE      13548
Name: InsuredGender, dtype: int64
-------------------------
Column name: InsuredEducationLevel
Data type: category
Value counts:
JD             4918
High School    4652
MD             4203
Masters        4166
Associate      4134
PhD            3793
College        3606
Name: InsuredEducationLevel, dtype: int64
-------------------------
Column name: InsuredOccupation
Data type: category
Value counts:
machine-op-inspct    2751
exec-managerial      2526
tech-support         2506
prof-specialty       2435
transport-moving     2316
sales                2141
craft-repair         2118
priv-house-serv      2072
armed-forces         2068
farming-fishing      1882
other-service        1796
protective-serv      1749
adm-clerical         1688
handlers-cleaners    1424
Name: InsuredOccupation, dtype: int64
-------------------------
Column name: InsuredHobbies
Data type: category
Value counts:
cross-fit         2278
chess             2133
paintball         1644
reading           1597
board-games       1575
bungie-jumping    1557
yachting          1556
polo              1542
base-jumping      1533
camping           1475
exercise          1453
video-games       1436
movies            1431
hiking            1373
kayaking          1370
golf              1253
skydiving         1247
sleeping          1169
dancing            947
basketball         903
Name: InsuredHobbies, dtype: int64
-------------------------
Column name: InsurancePolicyState
Data type: category
Value counts:
State3    10865
State1     9509
State2     9098
Name: InsurancePolicyState, dtype: int64
-------------------------
Column name: InsuredRelationship
Data type: category
Value counts:
other-relative    5879
not-in-family     5294
own-child         4945
husband           4819
unmarried         4465
wife              4070
Name: InsuredRelationship, dtype: int64
-------------------------
Column name: TypeOfIncident
Data type: category
Value counts:
Multi-vehicle Collision     12635
Single Vehicle Collision    12565
Vehicle Theft                2240
Parked Car                   2032
Name: TypeOfIncident, dtype: int64
-------------------------
Column name: TypeOfCollission
Data type: category
Value counts:
Rear Collision     11459
Side Collision      9944
Front Collision     8069
Name: TypeOfCollission, dtype: int64
-------------------------
Column name: SeverityOfIncident
Data type: category
Value counts:
Major Damage      11874
Minor Damage       8737
Total Loss         6761
Trivial Damage     2100
Name: SeverityOfIncident, dtype: int64
-------------------------
Column name: AuthoritiesContacted
Data type: category
Value counts:
Police       8172
Fire         6937
Other        6156
Ambulance    6152
None         2055
Name: AuthoritiesContacted, dtype: int64
-------------------------
Column name: IncidentState
Data type: category
Value counts:
State7    8340
State5    7632
State9    5632
State4    3372
State8    2971
State6     853
State3     672
Name: IncidentState, dtype: int64
-------------------------
Column name: IncidentCity
Data type: category
Value counts:
City1    4722
City2    4596
City4    4458
City7    4372
City3    4103
City5    3770
City6    3451
Name: IncidentCity, dtype: int64
-------------------------
Column name: IncidentAddress
Data type: category
Value counts:
Location 1393    166
Location 1183    164
Location 1254    148
Location 1192    143
Location 1746    139
                ... 
Location 1628      7
Location 1449      6
Location 1709      6
Location 1359      5
Location 2072      5
Name: IncidentAddress, Length: 1000, dtype: int64
-------------------------
Column name: PropertyDamage
Data type: category
Value counts:
NO     15416
YES    14056
Name: PropertyDamage, dtype: int64
-------------------------
Column name: PoliceReport
Data type: category
Value counts:
NO     15723
YES    13749
Name: PoliceReport, dtype: int64
-------------------------
Column name: VehicleMake
Data type: category
Value counts:
BMW           2425
Ford          2389
Saab          2381
Dodge         2334
Suburu        2290
Chevrolet     2229
Nissan        2184
Volkswagen    2132
Audi          2113
Toyota        2032
Accura        1888
Mercedes      1830
Jeep          1828
Honda         1417
Name: VehicleMake, dtype: int64
-------------------------
Column name: VehicleModel
Data type: category
Value counts:
RAM               1543
A3                1144
Jetta             1142
MDX               1087
Wrangler           994
Passat             982
92x                899
F150               894
A5                 844
Maxima             835
Grand Cherokee     827
Tahoe              825
Neon               824
Forrestor          820
Legacy             813
95                 781
Pathfinder         780
X5                 760
Highlander         744
Escape             743
Civic              728
Fusion             719
ML350              719
Silverado          715
Malibu             709
93                 703
Camry              674
X6                 667
M5                 658
Ultima             654
E400               643
Impreza            591
TL                 576
C300               508
Corolla            503
CRV                484
3 Series           339
RSX                324
Accord             277
Name: VehicleModel, dtype: int64
-------------------------
Column name: DayOfWeek
Data type: category
Value counts:
Friday       4681
Tuesday      4289
Saturday     4273
Sunday       4222
Thursday     4157
Monday       3999
Wednesday    3851
Name: DayOfWeek, dtype: int64
-------------------------
Column name: MonthOfIncident
Data type: category
Value counts:
January     15455
February    13800
March         217
Name: MonthOfIncident, dtype: int64
-------------------------

4.1.11.1 One-Hot Encoding

Train Data & Validation Data

In [149]:
from sklearn.preprocessing import OneHotEncoder
def perform_one_hot_encoding(X_train,X_test,cols, prefix):
  ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)
  X_train_new = ohe.fit_transform(X_train[cols])
  X_test_new = ohe.transform(X_test[cols])
  for i in range(len(ohe.categories_)):
    if i == 0:
      label = ohe.categories_[i]
      label_combined = label[1:]
      label_combined = [prefix + label for label in label_combined]
    elif i > 0:
      label = ohe.categories_[i]
      label_combined = np.append(label_combined,label[1:])
      label_combined = [prefix + label for label in label_combined]
  print("Labels: \n")
  print(label_combined, " \n")
  # Adding Transformed X_train back to the main DataFrame
  X_train[label_combined] = X_train_new
  # Adding Transformed X_test back to the main DataFrame
  X_test[label_combined] = X_test_new
  print("#### Train Columns Before Dropping #### '\n'")
  print(X_train.columns,'\n')
  print("#### Test Columns Before Dropping #### '\n'")
  print(X_test.columns,'\n')
  # Dropping the Encoded Column
  print("#### Dropping Encoded Column in Test ####" '\n')
  X_train.drop(cols,axis=1,inplace=True)
  print("#### Dropping Encoded Column in Test ####" '\n')
  X_test.drop(cols,axis=1,inplace=True)
  print("#### Train Columns After Dropping ####" '\n')
  print(X_train.columns,'\n')
  print("#### Test Columns After Dropping ####" '\n')
  print(X_test.columns,'\n')
  return ohe
In [150]:
X_train_cat.InsuredGender.value_counts()
Out[150]:
FEMALE    15924
MALE      13548
Name: InsuredGender, dtype: int64
In [151]:
cols = ['InsuredGender']
InsuredGender_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='InsuredGender_')
Labels: 

['InsuredGender_MALE']  

#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE'], dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE'], dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [152]:
# pickle.dump(InsuredGender_ohe,open('../Pickle Files/InsuredGender_ohe.pkl','wb'))
In [153]:
cols = ['InsurancePolicyState']
InsurancePolicyState_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='InsurancePolicyState_')
Labels: 

['InsurancePolicyState_State2', 'InsurancePolicyState_State3']  

#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3'], dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3'], dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [154]:
# pickle.dump(InsurancePolicyState_ohe,open('../Pickle Files/InsurancePolicyState_ohe.pkl','wb'))
In [155]:
cols = ['TypeOfIncident']
TypeOfIncident_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='TypeOfIncident_')
Labels: 

['TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft']  

#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft'], dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft'], dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [156]:
# pickle.dump(TypeOfIncident_ohe,open('../Pickle Files/TypeOfIncident_ohe.pkl','wb'))
In [157]:
cols = ['TypeOfCollission']
TypeOfCollission_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='TypeOfCollission_')
Labels: 

['TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision']  

#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision'], dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision'], dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [158]:
# pickle.dump(TypeOfCollission_ohe,open('../Pickle Files/TypeOfCollission_ohe.pkl','wb'))
In [159]:
cols = ['AuthoritiesContacted']
AuthoritiesContacted_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='AuthoritiesContacted_')
Labels: 

['AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police']  

#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police'], dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police'], dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [160]:
# pickle.dump(AuthoritiesContacted_ohe,open('../Pickle Files/AuthoritiesContacted_ohe.pkl','wb'))
In [161]:
cols = ['IncidentState']
IncidentState_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='IncidentState_')
Labels: 

['IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9']  

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9'], dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9'], dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [162]:
# pickle.dump(IncidentState_ohe,open('../Pickle Files/IncidentState_ohe.pkl','wb'))
In [163]:
cols = ['IncidentCity']
IncidentCity_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='IncidentCity_')
Labels: 

['IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7']  

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7'], dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7'], dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [164]:
# pickle.dump(IncidentCity_ohe,open('../Pickle Files/IncidentCity_ohe.pkl','wb'))
In [165]:
cols = ['PropertyDamage']
PropertyDamage_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='PropertyDamage_')
Labels: 

['PropertyDamage_YES']  

#### Train Columns Before Dropping #### '
'
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES'], dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES'], dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [166]:
# pickle.dump(PropertyDamage_ohe,open('../Pickle Files/PropertyDamage_ohe.pkl','wb'))
In [167]:
cols = ['PoliceReport']
PoliceReport_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='PoliceReport_')
Labels: 

['PoliceReport_YES']  

#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES'], dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES'], dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [168]:
# pickle.dump(PoliceReport_ohe,open('../Pickle Files/PoliceReport_ohe.pkl','wb'))
In [169]:
cols = ['DayOfWeek']
DayOfWeek_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='DayOfWeek_')
Labels: 

['DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday', 'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday']  

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday'],
      dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday'],
      dtype='object') 

#### Dropping Encoded Column in Test ####

#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday'],
      dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday'],
      dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [170]:
# pickle.dump(DayOfWeek_ohe,open('../Pickle Files/DayOfWeek_ohe.pkl','wb'))
In [171]:
cols = ['MonthOfIncident']
MonthOfIncident_ohe = perform_one_hot_encoding(X_train_cat,X_test_cat, cols, prefix='MonthOfIncident_')
Labels: 

['MonthOfIncident_January', 'MonthOfIncident_March']  

#### Train Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday', 'MonthOfIncident_January', 'MonthOfIncident_March'],
      dtype='object') 

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday', 'MonthOfIncident_January', 'MonthOfIncident_March'],
      dtype='object') 

#### Dropping Encoded Column in Test ####

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
#### Dropping Encoded Column in Test ####

#### Train Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday', 'DayOfWeek_Tuesday',
       'DayOfWeek_Wednesday', 'MonthOfIncident_January', 'MonthOfIncident_March'],
      dtype='object') 

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday', 'DayOfWeek_Tuesday',
       'DayOfWeek_Wednesday', 'MonthOfIncident_January', 'MonthOfIncident_March'],
      dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [172]:
# pickle.dump(MonthOfIncident_ohe,open('../Pickle Files/MonthOfIncident_ohe.pkl','wb'))

Test Data

In [173]:
X_train_cat.head()
Out[173]:
InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 443625 Masters priv-house-serv camping other-relative Minor Damage Location 1851 BMW X6 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0
1 472248 JD machine-op-inspct skydiving other-relative Total Loss Location 1159 Ford F150 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0
2 457875 JD sales dancing wife Minor Damage Location 1324 Nissan Maxima 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0
3 476303 JD sales video-games wife Total Loss Location 1073 Saab 95 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1
4 441726 Associate handlers-cleaners golf own-child Major Damage Location 1443 Audi A3 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0
In [174]:
def perform_one_hot_encoding_one_df(X_test,cols,ohe, prefix):
    X_test_new = ohe.transform(X_test[cols])
    for i in range(len(ohe.categories_)):
        if i == 0:
            label = ohe.categories_[i]
            label_combined = label[1:]
            label_combined = [prefix + label for label in label_combined]
        elif i > 0:
            label = ohe.categories_[i]
            label_combined = np.append(label_combined,label[1:])
            label_combined = [prefix + label for label in label_combined]
    print("Labels: \n")
    print(label_combined, " \n")
      # Adding Transformed X_test back to the main DataFrame
    X_test[label_combined] = X_test_new
    print("#### Test Columns Before Dropping #### '\n'")
    print(X_test.columns,'\n')
  # Dropping the Encoded Column
    print("#### Dropping Encoded Column in Test ####" '\n')
    X_test.drop(cols,axis=1,inplace=True)
    print("#### Test Columns After Dropping ####" '\n')
    print(X_test.columns,'\n')
    return None
In [175]:
cols = ['InsuredGender']
perform_one_hot_encoding_one_df(test_df_cat, cols, InsuredGender_ohe, prefix='InsuredGender_')
Labels: 

['InsuredGender_MALE']  

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [176]:
cols = ['InsurancePolicyState']
perform_one_hot_encoding_one_df(test_df_cat, cols, InsurancePolicyState_ohe, prefix='InsurancePolicyState_')
Labels: 

['InsurancePolicyState_State2', 'InsurancePolicyState_State3']  

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [177]:
cols = ['TypeOfIncident']
perform_one_hot_encoding_one_df(test_df_cat, cols, TypeOfIncident_ohe, prefix='TypeOfIncident_')
Labels: 

['TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft']  

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [178]:
cols = ['TypeOfCollission']
perform_one_hot_encoding_one_df(test_df_cat, cols, TypeOfCollission_ohe, prefix='TypeOfCollission_')
Labels: 

['TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision']  

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [179]:
cols = ['AuthoritiesContacted']
perform_one_hot_encoding_one_df(test_df_cat, cols, AuthoritiesContacted_ohe, prefix='AuthoritiesContacted_')
Labels: 

['AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police']  

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [180]:
cols = ['IncidentState']
perform_one_hot_encoding_one_df(test_df_cat, cols, IncidentState_ohe, prefix='IncidentState_')
Labels: 

['IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9']  

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [181]:
cols = ['IncidentCity']
perform_one_hot_encoding_one_df(test_df_cat, cols, IncidentCity_ohe, prefix='IncidentCity_')
Labels: 

['IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7']  

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [182]:
cols = ['PropertyDamage']
perform_one_hot_encoding_one_df(test_df_cat, cols, PropertyDamage_ohe, prefix='PropertyDamage_')
Labels: 

['PropertyDamage_YES']  

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [183]:
cols = ['PoliceReport']
perform_one_hot_encoding_one_df(test_df_cat, cols, PoliceReport_ohe, prefix='PoliceReport_')
Labels: 

['PoliceReport_YES']  

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'PoliceReport', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES'], dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES'], dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [184]:
cols = ['DayOfWeek']
perform_one_hot_encoding_one_df(test_df_cat, cols, DayOfWeek_ohe, prefix='DayOfWeek_')
Labels: 

['DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday', 'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday']  

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'DayOfWeek', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday'],
      dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday'],
      dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [185]:
cols = ['MonthOfIncident']
perform_one_hot_encoding_one_df(test_df_cat, cols, MonthOfIncident_ohe, prefix='MonthOfIncident_')
Labels: 

['MonthOfIncident_January', 'MonthOfIncident_March']  

#### Test Columns Before Dropping #### '
'
Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'MonthOfIncident', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday', 'MonthOfIncident_January', 'MonthOfIncident_March'],
      dtype='object') 

#### Dropping Encoded Column in Test ####

#### Test Columns After Dropping ####

Index(['InsuredZipCode', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'InsuredRelationship', 'SeverityOfIncident', 'IncidentAddress', 'VehicleMake', 'VehicleModel', 'InsuredGender_MALE', 'InsurancePolicyState_State2', 'InsurancePolicyState_State3', 'TypeOfIncident_Parked Car', 'TypeOfIncident_Single Vehicle Collision', 'TypeOfIncident_Vehicle Theft', 'TypeOfCollission_Rear Collision', 'TypeOfCollission_Side Collision', 'AuthoritiesContacted_Fire', 'AuthoritiesContacted_None', 'AuthoritiesContacted_Other', 'AuthoritiesContacted_Police', 'IncidentState_State4', 'IncidentState_State5', 'IncidentState_State6', 'IncidentState_State7', 'IncidentState_State8', 'IncidentState_State9', 'IncidentCity_City2', 'IncidentCity_City3', 'IncidentCity_City4', 'IncidentCity_City5', 'IncidentCity_City6', 'IncidentCity_City7', 'PropertyDamage_YES', 'PoliceReport_YES', 'DayOfWeek_Monday', 'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday', 'DayOfWeek_Tuesday',
       'DayOfWeek_Wednesday', 'MonthOfIncident_January', 'MonthOfIncident_March'],
      dtype='object') 

/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
/usr/share/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [186]:
test_df_cat.head()
Out[186]:
InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 471704 High School adm-clerical base-jumping own-child Minor Damage Location 1354 Volkswagen Passat 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 455810 MD prof-specialty golf unmarried Minor Damage Location 1383 Nissan Ultima 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0
2 461919 JD other-service movies other-relative Minor Damage Location 2030 Suburu Impreza 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
3 600904 Masters exec-managerial video-games wife Trivial Damage Location 1449 Accura TL 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
4 430632 PhD sales board-games own-child Minor Damage Location 1916 Dodge RAM 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

4.1.11.2 Ordinal Encoding

In [187]:
from sklearn.preprocessing import OrdinalEncoder

def ordinal_encode(X_train, X_test, column_name, categories):
    # Create an instance of the OrdinalEncoder class and fit the categories to it
    encoder = OrdinalEncoder(categories=[categories], dtype=int)
    encoder.fit(X_train[[column_name]])

    # Transform the categorical data in X_train and X_test using the encoder
    X_train_encoded = encoder.transform(X_train[[column_name]])
    X_test_encoded = encoder.transform(X_test[[column_name]])

    # Replace the original categorical column in X_train and X_test with the encoded values
    X_train[column_name] = X_train_encoded
    X_test[column_name] = X_test_encoded
    
    # Return the encoder for later use if needed
    return encoder
In [188]:
categories = ['High School', 'Associate', 'College', 'Masters', 'JD', 'MD', 'PhD']
InsuredEducationLevel_encoder = ordinal_encode(X_train_cat, X_test_cat, 'InsuredEducationLevel', categories)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [189]:
# pickle.dump(InsuredEducationLevel_encoder,open('../Pickle Files/InsuredEducationLevel_encoder.pkl','wb'))
In [190]:
categories = ['not-in-family', 'other-relative', 'own-child', 'unmarried', 'wife', 'husband']
InsuredRelationship_encoder = ordinal_encode(X_train_cat, X_test_cat, 'InsuredRelationship', categories)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [191]:
# pickle.dump(InsuredRelationship_encoder,open('../Pickle Files/InsuredRelationship_encoder.pkl','wb'))
In [192]:
categories = ['Trivial Damage', 'Minor Damage', 'Major Damage', 'Total Loss']
SeverityOfIncident_encoder = ordinal_encode(X_train_cat, X_test_cat, 'SeverityOfIncident', categories)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [193]:
# pickle.dump(SeverityOfIncident_encoder,open('../Pickle Files/SeverityOfIncident_encoder.pkl','wb'))
In [194]:
X_train_cat
Out[194]:
InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 443625 3 priv-house-serv camping 1 1 Location 1851 BMW X6 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0
1 472248 4 machine-op-inspct skydiving 1 3 Location 1159 Ford F150 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0
2 457875 4 sales dancing 4 1 Location 1324 Nissan Maxima 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0
3 476303 4 sales video-games 4 3 Location 1073 Saab 95 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1
4 441726 1 handlers-cleaners golf 2 2 Location 1443 Audi A3 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
29467 451312 4 handlers-cleaners golf 5 2 Location 1681 Volkswagen Ultima 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0
29468 434923 1 tech-support board-games 1 2 Location 1526 Chevrolet RAM 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
29469 463835 2 prof-specialty base-jumping 4 2 Location 1066 Nissan MDX 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0
29470 611996 0 farming-fishing video-games 0 2 Location 1014 Audi A5 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0
29471 433593 5 armed-forces polo 2 1 Location 1963 Audi A5 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0

29472 rows × 43 columns

Test Data

In [195]:
def ordinal_encoder_one_df(X_test, column_name, encoder):
    # Transform the categorical data in X_test using the encoder
    X_test_encoded = encoder.transform(X_test[[column_name]])

    # Replace the original categorical column in X_train and X_test with the encoded values
    X_test[column_name] = X_test_encoded
    
    return None
In [196]:
ordinal_encoder_one_df(test_df_cat, 'InsuredEducationLevel', InsuredEducationLevel_encoder)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [197]:
ordinal_encoder_one_df(test_df_cat, 'InsuredRelationship', InsuredRelationship_encoder)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [198]:
ordinal_encoder_one_df(test_df_cat, 'SeverityOfIncident', SeverityOfIncident_encoder)
/usr/share/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [199]:
test_df_cat
Out[199]:
InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 471704 0 adm-clerical base-jumping 2 1 Location 1354 Volkswagen Passat 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 455810 5 prof-specialty golf 3 1 Location 1383 Nissan Ultima 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0
2 461919 4 other-service movies 1 1 Location 2030 Suburu Impreza 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
3 600904 3 exec-managerial video-games 4 0 Location 1449 Accura TL 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
4 430632 6 sales board-games 2 1 Location 1916 Dodge RAM 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8907 446435 1 tech-support camping 4 3 Location 1958 Saab 95 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0
8908 438237 1 craft-repair movies 5 3 Location 1035 Saab 92x 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
8909 450339 0 armed-forces dancing 4 1 Location 2037 BMW Civic 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
8910 439304 6 transport-moving hiking 4 1 Location 2097 Jeep Grand Cherokee 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0
8911 450730 6 handlers-cleaners video-games 5 3 Location 1024 Suburu E400 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0

8912 rows × 43 columns

4.1.11.3 Label Encoding

In [200]:
X_train_cat.dtypes
Out[200]:
InsuredZipCode                             category
InsuredEducationLevel                         int64
InsuredOccupation                          category
InsuredHobbies                             category
InsuredRelationship                           int64
SeverityOfIncident                            int64
IncidentAddress                            category
VehicleMake                                category
VehicleModel                               category
InsuredGender_MALE                            int64
InsurancePolicyState_State2                   int64
InsurancePolicyState_State3                   int64
TypeOfIncident_Parked Car                     int64
TypeOfIncident_Single Vehicle Collision       int64
TypeOfIncident_Vehicle Theft                  int64
TypeOfCollission_Rear Collision               int64
TypeOfCollission_Side Collision               int64
AuthoritiesContacted_Fire                     int64
AuthoritiesContacted_None                     int64
AuthoritiesContacted_Other                    int64
AuthoritiesContacted_Police                   int64
IncidentState_State4                          int64
IncidentState_State5                          int64
IncidentState_State6                          int64
IncidentState_State7                          int64
IncidentState_State8                          int64
IncidentState_State9                          int64
IncidentCity_City2                            int64
IncidentCity_City3                            int64
IncidentCity_City4                            int64
IncidentCity_City5                            int64
IncidentCity_City6                            int64
IncidentCity_City7                            int64
PropertyDamage_YES                            int64
PoliceReport_YES                              int64
DayOfWeek_Monday                              int64
DayOfWeek_Saturday                            int64
DayOfWeek_Sunday                              int64
DayOfWeek_Thursday                            int64
DayOfWeek_Tuesday                             int64
DayOfWeek_Wednesday                           int64
MonthOfIncident_January                       int64
MonthOfIncident_March                         int64
dtype: object
In [201]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
In [202]:
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)

4.1.11.4 Target Encoding

Train Data

In [203]:
# !pip install category_encoders
In [204]:
import category_encoders as ce

target_encoder = ce.TargetEncoder(cols=['InsuredZipCode', 'InsuredOccupation', 'InsuredHobbies', 'IncidentAddress', 'VehicleMake', 'VehicleModel'])

target_encoder.fit(X_train_cat, y_train)

# Transform the training and validation data
X_train_cat_encoded = target_encoder.transform(X_train_cat)
X_test_cat_encoded = target_encoder.transform(X_test_cat)
In [205]:
X_train_cat_encoded.head()
Out[205]:
InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 0.348138 3 0.433398 0.319322 1 1 0.161291 0.588454 0.743628 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0
1 0.250075 4 0.446020 0.428228 1 3 0.250547 0.580578 0.597315 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0
2 0.344987 4 0.505838 0.221753 4 1 0.154709 0.414835 0.610778 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0
3 0.761433 4 0.505838 0.481894 4 3 0.841161 0.464091 0.450704 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1
4 0.237199 1 0.443820 0.302474 2 2 0.392423 0.569333 0.542832 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0
In [206]:
# pickle.dump(target_encoder,open('../Pickle Files/target_encoder.pkl','wb'))

Test Data

In [207]:
test_df_cat_encoded = target_encoder.transform(test_df_cat)
In [208]:
test_df_cat_encoded.head()
Out[208]:
InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 0.275492 0 0.348341 0.513372 2 1 0.275000 0.560507 0.568228 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 0.385142 5 0.500616 0.302474 3 1 0.402879 0.414835 0.354740 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0
2 0.299926 4 0.385301 0.421384 1 1 0.312507 0.487773 0.556684 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
3 0.365529 3 0.653998 0.481894 4 0 0.434061 0.371822 0.322917 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
4 0.336399 6 0.505838 0.553016 2 1 0.299344 0.518852 0.594945 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

4.1.12 Resetting the Index of Categorical & Numerical DataFrames

4.1.12.1 Train Data

In [209]:
X_train_num.reset_index(inplace = True, drop = True)
X_train_num.shape
Out[209]:
(29472, 19)
In [210]:
X_train_cat_encoded.reset_index(inplace = True, drop = True)
X_train_cat.shape
Out[210]:
(29472, 43)

4.1.12.2 Validation Data

In [211]:
X_test_num.reset_index(inplace = True, drop = True)
X_test_num.shape
Out[211]:
(8651, 19)
In [212]:
X_test_cat_encoded.reset_index(inplace = True, drop = True)
X_test_cat.shape
Out[212]:
(8651, 43)

4.1.12.3 Test Data

In [213]:
test_df_num.reset_index(inplace = True, drop = True)
test_df_num.shape
Out[213]:
(8912, 19)
In [214]:
test_df_cat_encoded.reset_index(inplace = True, drop = True)
test_df_cat_encoded.shape
Out[214]:
(8912, 43)

4.1.13 Combining Categorical & Numerical DataFrames

4.1.13.1 Train Data

In [215]:
# Combine numerical data and categorical data
def combine_num_df_cat_df(num_df, cat_df):
    result = pd.concat([num_df,cat_df],axis=1) # Using concat funtion in pandas we join the numerical columns and categorical columns
    display(result.head()) # Printing the head of combined dataset
    return result
In [216]:
X_train = combine_num_df_cat_df(X_train_num, X_train_cat_encoded)
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit VehicleAge TimeBetweenCoverageAndIncident InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 -0.371067 -0.834117 0.904630 -0.103752 -0.214596 0.914218 1.182516 0.869989 0.160479 0.010272 -1.526324 0.794385 0.382263 0.365752 0.920778 -1.137881 -1.025418 1.564428 0.871229 0.348138 3 0.433398 0.319322 1 1 0.161291 0.588454 0.743628 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0
1 -1.399944 -0.834117 0.904630 -0.804736 -1.136054 1.267395 -0.513538 -1.488634 2.254084 -1.313987 -0.517114 0.571910 0.899916 0.881911 0.351986 -1.137881 -0.294639 -1.165845 1.564490 0.250075 4 0.446020 0.428228 1 3 0.250547 0.580578 0.597315 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0
2 0.915030 0.850680 0.904630 0.751860 1.228015 -0.818553 -0.513538 -0.140850 -0.886324 0.010272 0.492096 -0.712174 -0.791913 -0.805035 -0.594164 -0.158391 -0.294639 -0.747609 -1.402459 0.344987 4 0.505838 0.221753 4 1 0.154709 0.414835 0.610778 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0
3 -0.113847 -0.834117 -1.278445 -0.155295 1.579226 -0.231393 -0.513538 1.880827 1.207282 1.334531 0.492096 0.597106 0.933518 -0.432970 0.707987 -0.158391 -1.025418 1.203806 -0.649139 0.761433 4 0.505838 0.481894 4 3 0.841161 0.464091 0.450704 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1
4 -1.271335 -0.834117 -1.492046 -0.825353 -1.139831 0.247915 -0.513538 0.364570 1.207282 -1.313987 1.501305 0.727125 -0.085809 -0.100969 1.060703 1.474093 1.532310 0.032317 1.320390 0.237199 1 0.443820 0.302474 2 2 0.392423 0.569333 0.542832 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0
In [217]:
print(X_train.shape)
(29472, 62)

4.1.13.2 Validation Data

In [218]:
X_test = combine_num_df_cat_df(X_test_num, X_test_cat_encoded)
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit VehicleAge TimeBetweenCoverageAndIncident InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 0.271982 -0.834117 0.904630 0.411677 -1.158713 -0.780476 -0.513538 -0.646269 -0.886324 0.010272 0.492096 -1.994660 -1.496371 -1.644525 -1.996815 -0.158391 -0.294639 -0.574233 -0.143780 0.355902 4 0.534717 0.435523 1 1 0.313262 0.371822 0.297436 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0
1 -1.657164 -0.834117 0.904630 -1.072759 -0.214596 1.730145 -0.513538 -0.814742 1.207282 0.010272 -0.517114 0.748872 1.501455 -0.251389 0.732393 -0.158391 -0.294639 1.392652 -1.301645 0.287491 5 0.433398 0.435523 5 3 0.153326 0.371822 0.465501 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
2 -0.499676 0.331718 -0.710049 -0.763501 -1.158713 -0.688574 -0.513538 1.206935 2.254084 0.010272 0.492096 0.167087 -0.002863 -0.018262 0.238168 -0.158391 -0.294639 -0.575300 1.057843 0.999944 0 0.653998 0.309489 2 1 0.993242 0.560507 0.561296 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0
3 0.143372 0.993211 -1.155353 -0.134678 -0.348661 -1.301203 -0.513538 -0.140850 -0.886324 -1.313987 0.492096 0.785342 0.100292 1.512876 0.691325 1.474093 1.532310 -0.362447 -1.383154 0.311230 3 0.501451 0.428228 5 3 0.437418 0.500224 0.542832 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0
4 -0.242457 1.077269 0.904630 -0.526404 -0.988772 -1.344040 -0.513538 -1.320161 1.207282 -1.313987 1.501305 -0.457688 -0.611686 -0.625328 -0.329216 -0.158391 -0.294639 1.180333 -0.272050 0.349508 0 0.563472 0.349945 0 1 0.349945 0.414835 0.561296 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
In [219]:
print(X_test.shape)
(8651, 62)

4.1.13.3 Test Data

In [220]:
train_df = combine_num_df_cat_df(test_df_num, test_df_cat_encoded)
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit VehicleAge TimeBetweenCoverageAndIncident InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 -1.528554 1.227110 -1.158973 -1.237696 1.673638 -1.285152 -0.513538 -1.320161 1.207282 -1.313987 -1.526324 0.585580 -0.164996 0.106853 0.831015 1.474093 1.532310 1.970395 0.585517 0.275492 0 0.348341 0.513372 2 1 0.275000 0.560507 0.568228 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 0.143372 1.238074 -1.470324 0.287974 -1.158713 0.098037 -0.513538 0.701516 -0.886324 0.010272 -0.517114 0.035217 0.135303 -0.553633 0.153803 -1.137881 -1.025418 -0.185337 -1.540167 0.385142 5 0.500616 0.302474 3 1 0.402879 0.414835 0.354740 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0
2 0.014762 0.276899 0.904630 0.143654 -0.214596 0.986676 0.104372 1.375408 1.207282 -1.313987 -0.517114 0.213778 -0.011087 0.353100 0.212354 -0.158391 -0.294639 -0.968997 -1.337251 0.299926 4 0.385301 0.421384 1 1 0.312507 0.487773 0.556684 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
3 -0.113847 1.669323 0.904630 0.009642 1.673638 0.141532 2.422672 -0.309323 -0.886324 1.334531 -0.517114 -2.081816 -1.654980 -1.578453 -2.095320 1.474093 -0.294639 -0.775349 0.204996 0.365529 3 0.653998 0.481894 4 0 0.434061 0.371822 0.322917 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
4 -1.271335 0.448667 0.904630 -0.722267 1.673638 0.776390 1.986125 -0.814742 1.207282 1.334531 1.501305 -0.832141 0.038258 -0.879307 -0.950166 -1.137881 -1.025418 -0.358713 0.228161 0.336399 6 0.505838 0.553016 2 1 0.299344 0.518852 0.594945 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
In [221]:
print(train_df.shape)
(8912, 62)
In [222]:
test_df_final = combine_num_df_cat_df(test_df_num, test_df_cat_encoded)
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit VehicleAge TimeBetweenCoverageAndIncident InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 -1.528554 1.227110 -1.158973 -1.237696 1.673638 -1.285152 -0.513538 -1.320161 1.207282 -1.313987 -1.526324 0.585580 -0.164996 0.106853 0.831015 1.474093 1.532310 1.970395 0.585517 0.275492 0 0.348341 0.513372 2 1 0.275000 0.560507 0.568228 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 0.143372 1.238074 -1.470324 0.287974 -1.158713 0.098037 -0.513538 0.701516 -0.886324 0.010272 -0.517114 0.035217 0.135303 -0.553633 0.153803 -1.137881 -1.025418 -0.185337 -1.540167 0.385142 5 0.500616 0.302474 3 1 0.402879 0.414835 0.354740 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0
2 0.014762 0.276899 0.904630 0.143654 -0.214596 0.986676 0.104372 1.375408 1.207282 -1.313987 -0.517114 0.213778 -0.011087 0.353100 0.212354 -0.158391 -0.294639 -0.968997 -1.337251 0.299926 4 0.385301 0.421384 1 1 0.312507 0.487773 0.556684 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
3 -0.113847 1.669323 0.904630 0.009642 1.673638 0.141532 2.422672 -0.309323 -0.886324 1.334531 -0.517114 -2.081816 -1.654980 -1.578453 -2.095320 1.474093 -0.294639 -0.775349 0.204996 0.365529 3 0.653998 0.481894 4 0 0.434061 0.371822 0.322917 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
4 -1.271335 0.448667 0.904630 -0.722267 1.673638 0.776390 1.986125 -0.814742 1.207282 1.334531 1.501305 -0.832141 0.038258 -0.879307 -0.950166 -1.137881 -1.025418 -0.358713 0.228161 0.336399 6 0.505838 0.553016 2 1 0.299344 0.518852 0.594945 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
In [223]:
test_df_final.shape
Out[223]:
(8912, 62)

5. Model Building

5.1 LazyPredict Clasifier

In [224]:
!pip install lazypredict
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: lazypredict in /home/5014b121/.local/lib/python3.7/site-packages (0.2.12)
Requirement already satisfied: scikit-learn in /usr/share/anaconda3/lib/python3.7/site-packages (from lazypredict) (0.24.2)
Requirement already satisfied: xgboost in /usr/share/anaconda3/lib/python3.7/site-packages (from lazypredict) (1.4.2)
Requirement already satisfied: joblib in /usr/share/anaconda3/lib/python3.7/site-packages (from lazypredict) (0.14.1)
Requirement already satisfied: click in /usr/share/anaconda3/lib/python3.7/site-packages (from lazypredict) (7.1.2)
Requirement already satisfied: lightgbm in /usr/share/anaconda3/lib/python3.7/site-packages (from lazypredict) (3.2.1)
Requirement already satisfied: tqdm in /home/5014b121/.local/lib/python3.7/site-packages (from lazypredict) (4.64.1)
Requirement already satisfied: pandas in /usr/share/anaconda3/lib/python3.7/site-packages (from lazypredict) (1.2.4)
Requirement already satisfied: numpy>=1.13.3 in /usr/share/anaconda3/lib/python3.7/site-packages (from scikit-learn->lazypredict) (1.19.5)
Requirement already satisfied: scipy>=0.19.1 in /usr/share/anaconda3/lib/python3.7/site-packages (from scikit-learn->lazypredict) (1.6.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/share/anaconda3/lib/python3.7/site-packages (from scikit-learn->lazypredict) (2.1.0)
Requirement already satisfied: wheel in /usr/share/anaconda3/lib/python3.7/site-packages (from lightgbm->lazypredict) (0.36.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/share/anaconda3/lib/python3.7/site-packages (from pandas->lazypredict) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /usr/share/anaconda3/lib/python3.7/site-packages (from pandas->lazypredict) (2019.3)
Requirement already satisfied: six>=1.5 in /usr/share/anaconda3/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas->lazypredict) (1.15.0)
In [225]:
from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
models = clf.fit(X_train, X_test, y_train, y_test)

# Display the models
print(models)
 90%|████████▉ | 26/29 [04:12<00:44, 14.84s/it]
[19:53:44] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
100%|██████████| 29/29 [04:14<00:00,  8.77s/it]
(                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  Time Taken
Model                                                                                    
LinearDiscriminantAnalysis         0.94               0.91     0.91      0.94        0.42
RidgeClassifierCV                  0.94               0.91     0.91      0.94        0.28
RidgeClassifier                    0.94               0.91     0.91      0.94        0.18
LinearSVC                          0.94               0.91     0.91      0.94       10.25
CalibratedClassifierCV             0.94               0.91     0.91      0.94       23.09
LGBMClassifier                     0.94               0.91     0.91      0.94        0.50
LogisticRegression                 0.94               0.91     0.91      0.94        0.22
SVC                                0.94               0.91     0.91      0.94       42.00
BernoulliNB                        0.94               0.91     0.91      0.94        0.14
AdaBoostClassifier                 0.94               0.91     0.91      0.94        3.25
RandomForestClassifier             0.94               0.91     0.91      0.94        6.22
SGDClassifier                      0.94               0.91     0.91      0.94        0.28
XGBClassifier                      0.94               0.91     0.91      0.94        1.79
ExtraTreesClassifier               0.94               0.91     0.91      0.94        3.22
NuSVC                              0.94               0.91     0.91      0.94       81.19
KNeighborsClassifier               0.94               0.91     0.91      0.93        5.16
GaussianNB                         0.93               0.90     0.90      0.93        0.11
BaggingClassifier                  0.93               0.90     0.90      0.93        5.21
QuadraticDiscriminantAnalysis      0.92               0.90     0.90      0.92        0.18
NearestCentroid                    0.92               0.89     0.89      0.92        0.12
Perceptron                         0.89               0.87     0.87      0.89        0.14
DecisionTreeClassifier             0.87               0.85     0.85      0.87        0.86
PassiveAggressiveClassifier        0.85               0.85     0.85      0.86        0.14
ExtraTreeClassifier                0.87               0.85     0.85      0.87        0.11
LabelSpreading                     0.88               0.85     0.85      0.88       39.64
LabelPropagation                   0.88               0.85     0.85      0.88       29.47
DummyClassifier                    0.73               0.50     0.50      0.62        0.08,                                Accuracy  Balanced Accuracy  ROC AUC  F1 Score  Time Taken
Model                                                                                    
LinearDiscriminantAnalysis         0.94               0.91     0.91      0.94        0.42
RidgeClassifierCV                  0.94               0.91     0.91      0.94        0.28
RidgeClassifier                    0.94               0.91     0.91      0.94        0.18
LinearSVC                          0.94               0.91     0.91      0.94       10.25
CalibratedClassifierCV             0.94               0.91     0.91      0.94       23.09
LGBMClassifier                     0.94               0.91     0.91      0.94        0.50
LogisticRegression                 0.94               0.91     0.91      0.94        0.22
SVC                                0.94               0.91     0.91      0.94       42.00
BernoulliNB                        0.94               0.91     0.91      0.94        0.14
AdaBoostClassifier                 0.94               0.91     0.91      0.94        3.25
RandomForestClassifier             0.94               0.91     0.91      0.94        6.22
SGDClassifier                      0.94               0.91     0.91      0.94        0.28
XGBClassifier                      0.94               0.91     0.91      0.94        1.79
ExtraTreesClassifier               0.94               0.91     0.91      0.94        3.22
NuSVC                              0.94               0.91     0.91      0.94       81.19
KNeighborsClassifier               0.94               0.91     0.91      0.93        5.16
GaussianNB                         0.93               0.90     0.90      0.93        0.11
BaggingClassifier                  0.93               0.90     0.90      0.93        5.21
QuadraticDiscriminantAnalysis      0.92               0.90     0.90      0.92        0.18
NearestCentroid                    0.92               0.89     0.89      0.92        0.12
Perceptron                         0.89               0.87     0.87      0.89        0.14
DecisionTreeClassifier             0.87               0.85     0.85      0.87        0.86
PassiveAggressiveClassifier        0.85               0.85     0.85      0.86        0.14
ExtraTreeClassifier                0.87               0.85     0.85      0.87        0.11
LabelSpreading                     0.88               0.85     0.85      0.88       39.64
LabelPropagation                   0.88               0.85     0.85      0.88       29.47
DummyClassifier                    0.73               0.50     0.50      0.62        0.08)

5.2 Logistic Regression

In [226]:
from sklearn import metrics

def fit_and_evaluate_classification(model_name, model, X_train, X_test, y_train, y_test):
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the train set
    y_train_pred = model.predict(X_train)
    
    # Predict on the test set
    y_test_pred = model.predict(X_test)
    
    # Evaluate the model using various metrics
    accuracy_train = metrics.accuracy_score(y_train, y_train_pred)
    precision_train = metrics.precision_score(y_train, y_train_pred)
    recall_train = metrics.recall_score(y_train, y_train_pred)
    f1_score_train = metrics.f1_score(y_train, y_train_pred)
    roc_auc_train = metrics.roc_auc_score(y_train, y_train_pred)
    
    accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
    precision_test = metrics.precision_score(y_test, y_test_pred)
    recall_test = metrics.recall_score(y_test, y_test_pred)
    f1_score_test = metrics.f1_score(y_test, y_test_pred)
    roc_auc_test = metrics.roc_auc_score(y_test, y_test_pred)
    
    # Return the results
    return model_name, model, accuracy_train, precision_train, recall_train, f1_score_train, roc_auc_train, accuracy_test, precision_test, recall_test, f1_score_test, roc_auc_test

results_df = pd.DataFrame()
In [227]:
print(X_train.shape)
(29472, 62)
In [228]:
print(X_test.shape)
(8651, 62)
In [229]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
result = fit_and_evaluate_classification("LR", lr, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
In [230]:
# from sklearn.model_selection import GridSearchCV

# # define the logistic regression model
# logreg = LogisticRegression()

# # define the grid of hyperparameters to search over
# param_grid = {'C': [0.1, 1, 10, 100],
#               'penalty': ['l1', 'l2', 'elasticnet', 'none'],
#               'max_iter': [100, 500, 1000]}

# # perform grid search using 5-fold cross validation
# grid = GridSearchCV(logreg, param_grid, cv=5, scoring='f1', n_jobs=-1)

# # Fit the GridSearchCV object to the data
# grid.fit(X_train, y_train)

# # Print the best hyperparameters and the corresponding F1 score
# print("Best hyperparameters: ", grid.best_params_)
# print("Best F1 score: ", grid.best_score_)

# Best hyperparameters:  {'C': 0.1, 'max_iter': 100, 'penalty': 'none'}
# Best F1 score:  0.9431984482248943
In [231]:
from sklearn.linear_model import LogisticRegression
lr_gscv = LogisticRegression(C=0.1, max_iter=500, penalty='l2')
result = fit_and_evaluate_classification("Logistic Regression Grid Seach", lr_gscv, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91

5.3 Decision Tree Classifier

In [232]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier() 
# Define the models to evaluate
result = fit_and_evaluate_classification("Decision Tree", dt, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
In [233]:
# from sklearn.model_selection import GridSearchCV

# from sklearn.metrics import f1_score, make_scorer

# dtc = DecisionTreeClassifier()

# # define the grid of hyperparameters to search over
# param_grid = {'criterion': ['gini', 'entropy'],
#               'max_depth': [2, 4, 6, 8, 10]}

# # define the f1 score as the evaluation metric
# scorer = make_scorer(f1_score, average='micro')

# # perform grid search using 5-fold cross validation and f1 score as the evaluation metric
# grid = GridSearchCV(dtc, param_grid, cv=5, scoring=scorer)

# # fit the grid search to the data
# grid.fit(X_train, y_train)

# # print the best hyperparameters and the corresponding score
# print("Best hyperparameters: ", grid.best_params_)
# print("Best score: ", grid.best_score_)

# Best hyperparameters:  {'criterion': 'gini', 'max_depth': 6}
# Best score:  0.946697301175733
In [234]:
from sklearn.tree import DecisionTreeClassifier
dt_gscv = DecisionTreeClassifier(criterion='gini', max_depth=6) 
# Define the models to evaluate
result = fit_and_evaluate_classification("Decision Tree Grid Search", dt_gscv, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91

5.4 Random Forest Classifier

In [235]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier() 
# Define the models to evaluate
result = fit_and_evaluate_classification("Random Forest Classifier", rf, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
In [236]:
# from sklearn.model_selection import GridSearchCV

# from sklearn.metrics import f1_score, make_scorer

# rfc = RandomForestClassifier()

# # define the grid of hyperparameters to search over
# param_grid = {'n_estimators': [50, 100, 150],
#               'max_depth': [2, 4, 6, 8],
#               'min_samples_split': [2, 4, 6]}

# # define the f1 score as the evaluation metric
# scorer = make_scorer(f1_score, average='micro')

# # perform grid search using 5-fold cross validation and f1 score as the evaluation metric
# grid = GridSearchCV(rfc, param_grid, cv=5, scoring=scorer, n_jobs=-1, verbose=1)

# # fit the grid search to the data
# grid.fit(X_train, y_train)

# # print the best hyperparameters and the corresponding score
# print("Best hyperparameters: ", grid.best_params_)
# print("Best score: ", grid.best_score_)

# Best hyperparameters:  {'max_depth': 8, 'min_samples_split': 2, 'n_estimators': 100}
# Best score:  0.9482242489810803
In [237]:
from sklearn.ensemble import RandomForestClassifier

rf_gscv = RandomForestClassifier(max_depth=8, min_samples_split=2, n_estimators=100) 
# Define the models to evaluate
result = fit_and_evaluate_classification("Random Forest Classifier Grid Search", rf_gscv, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
In [240]:
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(model, X_train, X_test, y_train, y_test):
    
    # Make predictions on the testing data
    y_pred = model.predict(X_test)
    
    # Compute the confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    # Plot the confusion matrix as a heatmap
    sns.heatmap(cm, annot=True, cmap='Blues', fmt='g', 
                xticklabels=['Negative', 'Positive'], 
                yticklabels=['Negative', 'Positive'])
    
plot_confusion_matrix(rf_gscv, X_train, X_test, y_train, y_test)

5.5 K-Nearest Neighbour

In [244]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN classifier with a chosen value of K
knn = KNeighborsClassifier(n_neighbors=5)

result = fit_and_evaluate_classification("K-Nearest Neighbour Classifier", knn, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
In [245]:
# from sklearn.model_selection import GridSearchCV

# from sklearn.metrics import f1_score, make_scorer

# # Define the KNN classifier
# knn = KNeighborsClassifier()

# # Define the parameter grid to search over
# param_grid = {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}

# # define the f1 score as the evaluation metric
# scorer = make_scorer(f1_score, average='micro')

# # perform grid search using 5-fold cross validation and f1 score as the evaluation metric
# grid = GridSearchCV(knn, param_grid, cv=5, scoring=scorer, n_jobs=-1, verbose=1)

# # fit the grid search to the data
# grid.fit(X_train, y_train)

# # print the best hyperparameters and the corresponding score
# print("Best hyperparameters: ", grid.best_params_)
# print("Best score: ", grid.best_score_)

# Best hyperparameters:  {'n_neighbors': 5, 'weights': 'distance'}
# Best score:  0.9405219609194152
In [246]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN classifier with a chosen value of K
knn_gscv = KNeighborsClassifier(n_neighbors=5, weights='distance')

result = fit_and_evaluate_classification("K-Nearest Neighbour Classifier Grid Search", knn_gscv, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91

5.6 Support Vector Classifier

In [247]:
from sklearn.svm import SVC
svc = SVC()
# Define the models to evaluate
result = fit_and_evaluate_classification("Support Vector Classifier", svc, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91
8 Support Vector Classifier 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
In [248]:
# from sklearn.model_selection import GridSearchCV

# from sklearn.metrics import f1_score, make_scorer

# # Define the SVC classifier
# svc = SVC()

# # Define the parameter grid to search over
# param_grid = {'C': [0.1, 1, 10], 'gamma': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}

# # define the f1 score as the evaluation metric
# scorer = make_scorer(f1_score, average='micro')

# # perform grid search using 5-fold cross validation and f1 score as the evaluation metric
# grid = GridSearchCV(svc, param_grid, cv=5, scoring=scorer, n_jobs=-1, verbose=1)

# # fit the grid search to the data
# grid.fit(X_train, y_train)

# # print the best hyperparameters and the corresponding score
# print("Best hyperparameters: ", grid.best_params_)
# print("Best score: ", grid.best_score_)

# Best hyperparameters:  {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
# Best score:  0.9509729334729787
In [249]:
from sklearn.svm import SVC
svc_gscv = SVC(C=1, gamma=0.1, kernel='rbf')
# Define the models to evaluate
result = fit_and_evaluate_classification("Support Vector Classifier Grid Search", svc_gscv, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91
8 Support Vector Classifier 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
9 Support Vector Classifier Grid Search 0.97 0.98 0.95 0.97 0.97 0.94 0.94 0.84 0.89 0.91

5.7 XGBoost

In [250]:
# Step 1: Import libraries
import xgboost as xgb

# Step 4: Define the XGBoost model
xgb_classifier = xgb.XGBClassifier()

result = fit_and_evaluate_classification("XGBoost Classifier", xgb_classifier, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
[20:04:36] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91
8 Support Vector Classifier 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
9 Support Vector Classifier Grid Search 0.97 0.98 0.95 0.97 0.97 0.94 0.94 0.84 0.89 0.91
10 XGBoost Classifier 0.97 0.99 0.96 0.97 0.97 0.94 0.93 0.85 0.89 0.91
In [251]:
# from sklearn.model_selection import GridSearchCV

# from sklearn.metrics import f1_score, make_scorer

# # Step 1: Import libraries
# import xgboost as xgb

# # Step 4: Define the XGBoost model
# xgb_classifier_gscv = xgb.XGBClassifier()

# # Define the parameter grid to search over
# param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.1, 1], 'use_label_encoder': [False], 'eval_metric': ['eror', 'logloss']}

# # define the f1 score as the evaluation metric
# scorer = make_scorer(f1_score, average='micro')

# # perform grid search using 5-fold cross validation and f1 score as the evaluation metric
# grid = GridSearchCV(xgb_classifier_gscv, param_grid, cv=5, scoring=scorer, n_jobs=-1, verbose=1)

# # fit the grid search to the data
# grid.fit(X_train, y_train)

# # print the best hyperparameters and the corresponding score
# print("Best hyperparameters: ", grid.best_params_)
# print("Best score: ", grid.best_score_)

# Best hyperparameters:  {'eval_metric': 'logloss', 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'use_label_encoder': False}
# Best score:  0.9572503311974945
In [252]:
# Step 1: Import libraries
import xgboost as xgb

# Step 4: Define the XGBoost model
xgb_classifier_gscv = xgb.XGBClassifier(eval_metric='logloss', learning_rate=0.1, max_depth=5, n_estimators=200, use_label_encoder=False)

result = fit_and_evaluate_classification("XGBoost Classifier Grid Search", xgb_classifier_gscv, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91
8 Support Vector Classifier 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
9 Support Vector Classifier Grid Search 0.97 0.98 0.95 0.97 0.97 0.94 0.94 0.84 0.89 0.91
10 XGBoost Classifier 0.97 0.99 0.96 0.97 0.97 0.94 0.93 0.85 0.89 0.91
11 XGBoost Classifier Grid Search 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91

6. Ensemble Techniques

6.1 Stacking

In [247]:
# !pip install mlxtend
In [248]:
# from mlxtend.classifier import StackingCVClassifier

# stack_lr = StackingCVClassifier(classifiers=(xgb_classifier_gscv ,rf_gscv, svc_gscv, knn_gscv, lr_gscv),meta_classifier= lr_gscv, use_features_in_secondary=True,cv=5, n_jobs = -1)

# # Evaluate the models
# result = fit_and_evaluate_classification("Stcked Model-Linear Regression", stack_lr, X_train, X_test, y_train, y_test)
# new_row = pd.DataFrame({'Model Name': [result[0]], 
#                         'Accuracy (Train)': [result[2]], 
#                         'Precision (Train)': [result[3]], 
#                         'Recall (Train)': [result[4]], 
#                         'F1 Score (Train)': [result[5]], 
#                         'ROC AUC Score (Train)': [result[6]], 
#                         'Accuracy (Test)': [result[7]], 
#                         'Precision (Test)': [result[8]], 
#                         'Recall (Test)': [result[9]], 
#                         'F1 Score (Test)': [result[10]], 
#                         'ROC AUC Score (Test)': [result[11]]})

# results_df = pd.concat([results_df, new_row], ignore_index=True)



# display(results_df)
In [253]:
from mlxtend.classifier import StackingCVClassifier

stack_rf = StackingCVClassifier(classifiers=(xgb_classifier_gscv ,rf_gscv, svc_gscv, knn_gscv, lr_gscv),meta_classifier= rf_gscv, use_features_in_secondary=True,cv=5, n_jobs = -1)

# Evaluate the models
result = fit_and_evaluate_classification("Stcked Model-Random Forest", stack_rf, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91
8 Support Vector Classifier 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
9 Support Vector Classifier Grid Search 0.97 0.98 0.95 0.97 0.97 0.94 0.94 0.84 0.89 0.91
10 XGBoost Classifier 0.97 0.99 0.96 0.97 0.97 0.94 0.93 0.85 0.89 0.91
11 XGBoost Classifier Grid Search 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
12 Stcked Model-Random Forest 0.97 0.98 0.95 0.97 0.97 0.94 0.94 0.85 0.89 0.91
In [250]:
# from mlxtend.classifier import StackingCVClassifier

# stack_knn = StackingCVClassifier(classifiers=(xgb_classifier_gscv ,rf_gscv, svc_gscv, knn_gscv, lr_gscv),meta_classifier= knn_gscv, use_features_in_secondary=True,cv=5, n_jobs = -1)

# # Evaluate the models
# result = fit_and_evaluate_classification("Stcked Model-KNN", stack_knn, X_train, X_test, y_train, y_test)
# new_row = pd.DataFrame({'Model Name': [result[0]], 
#                         'Accuracy (Train)': [result[2]], 
#                         'Precision (Train)': [result[3]], 
#                         'Recall (Train)': [result[4]], 
#                         'F1 Score (Train)': [result[5]], 
#                         'ROC AUC Score (Train)': [result[6]], 
#                         'Accuracy (Test)': [result[7]], 
#                         'Precision (Test)': [result[8]], 
#                         'Recall (Test)': [result[9]], 
#                         'F1 Score (Test)': [result[10]], 
#                         'ROC AUC Score (Test)': [result[11]]})

# results_df = pd.concat([results_df, new_row], ignore_index=True)



# display(results_df)
In [251]:
# from mlxtend.classifier import StackingCVClassifier

# stack_svc = StackingCVClassifier(classifiers=(xgb_classifier_gscv ,rf_gscv, svc_gscv, knn_gscv, lr_gscv),meta_classifier= svc_gscv, use_features_in_secondary=True,cv=5, n_jobs = -1)

# # Evaluate the models
# result = fit_and_evaluate_classification("Stcked Model-Support Vector Classifier", stack_svc, X_train, X_test, y_train, y_test)
# new_row = pd.DataFrame({'Model Name': [result[0]], 
#                         'Accuracy (Train)': [result[2]], 
#                         'Precision (Train)': [result[3]], 
#                         'Recall (Train)': [result[4]], 
#                         'F1 Score (Train)': [result[5]], 
#                         'ROC AUC Score (Train)': [result[6]], 
#                         'Accuracy (Test)': [result[7]], 
#                         'Precision (Test)': [result[8]], 
#                         'Recall (Test)': [result[9]], 
#                         'F1 Score (Test)': [result[10]], 
#                         'ROC AUC Score (Test)': [result[11]]})

# results_df = pd.concat([results_df, new_row], ignore_index=True)



# display(results_df)
In [252]:
# from mlxtend.classifier import StackingCVClassifier

# stack_xgb = StackingCVClassifier(classifiers=(xgb_classifier_gscv ,rf_gscv, svc_gscv, knn_gscv, lr_gscv),meta_classifier= xgb_classifier_gscv, use_features_in_secondary=True,cv=5, n_jobs = -1)

# # Evaluate the models
# result = fit_and_evaluate_classification("Stcked Model-XGBoost", stack_xgb, X_train, X_test, y_train, y_test)
# new_row = pd.DataFrame({'Model Name': [result[0]], 
#                         'Accuracy (Train)': [result[2]], 
#                         'Precision (Train)': [result[3]], 
#                         'Recall (Train)': [result[4]], 
#                         'F1 Score (Train)': [result[5]], 
#                         'ROC AUC Score (Train)': [result[6]], 
#                         'Accuracy (Test)': [result[7]], 
#                         'Precision (Test)': [result[8]], 
#                         'Recall (Test)': [result[9]], 
#                         'F1 Score (Test)': [result[10]], 
#                         'ROC AUC Score (Test)': [result[11]]})

# results_df = pd.concat([results_df, new_row], ignore_index=True)



# display(results_df)

6.2 Voting

In [253]:
from sklearn.ensemble import VotingClassifier

# create the voting classifier
ensemble_hard = VotingClassifier(estimators=[('lr', lr_gscv), ('rf', rf_gscv), ('svc', svc_gscv), ('knn', knn_gscv), ('xgb', xgb_classifier_gscv)], voting='hard')

# Evaluate the models
result = fit_and_evaluate_classification("Hard Voting", ensemble_hard, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91
8 Support Vector Classifier 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
9 Support Vector Classifier Grid Search 0.97 0.98 0.95 0.97 0.97 0.94 0.94 0.84 0.89 0.91
10 XGBoost Classifier 0.97 0.99 0.96 0.97 0.97 0.94 0.93 0.85 0.89 0.91
11 XGBoost Classifier Grid Search 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
12 Hard Voting 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
In [257]:
from sklearn.ensemble import VotingClassifier

# create the voting classifier
ensemble_soft = VotingClassifier(estimators=[('lr', lr_gscv), ('rf', rf_gscv), ('knn', knn_gscv), ('xgb', xgb_classifier_gscv)], voting='soft')

# Evaluate the models
result = fit_and_evaluate_classification("Soft Voting", ensemble_soft, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)



display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91
8 Support Vector Classifier 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
9 Support Vector Classifier Grid Search 0.97 0.98 0.95 0.97 0.97 0.94 0.94 0.84 0.89 0.91
10 XGBoost Classifier 0.97 0.99 0.96 0.97 0.97 0.94 0.93 0.85 0.89 0.91
11 XGBoost Classifier Grid Search 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
12 Hard Voting 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
13 Soft Voting 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
In [536]:
print(test_df_final.shape)
(8912, 62)
In [537]:
Test_pred = pd.DataFrame(rf_gscv.predict(test_df_final), columns=["ReportedFraud"])
In [538]:
Test=pd.read_csv("../Test Data/Test.csv")
Test["ReportedFraud"]=Test_pred
In [539]:
Test.head()
Out[539]:
CustomerID ReportedFraud
0 Cust10008 0
1 Cust10010 0
2 Cust10015 0
3 Cust10020 0
4 Cust1003 0
In [540]:
Test.to_csv("./Test_predictions.csv")
In [ ]:
# # Import necessary libraries
# from sklearn.metrics import f1_score
# from keras.models import Sequential
# from keras.layers import Dense
# from keras import models

# # Define the model architecture
# model = Sequential()
# model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
# model.add(Dense(32, activation='relu'))
# model.add(Dense(1, activation='sigmoid'))

# # Compile the model
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['f1'])

# # Train the model
# model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1)

# # Predict on the test set
# y_train_pred = model.predict(X_train)
# y_train_pred = (y_train_pred > 0.5).astype(int)

# # Predict on the test set
# y_test_pred = model.predict(X_test)
# y_test_pred = (y_test_pred > 0.5).astype(int)

# # Calculate the F1 score
# f1_train = f1_score(y_train, y_train_pred)

# # Calculate the F1 score
# f1_test = f1_score(y_test, y_test_pred)

# # Print the F1 score
# print('F1 score Train: {:.2f}'.format(f1_train))

# # Print the F1 score
# print('F1 score Test: {:.2f}'.format(f1_test))
In [243]:
# F1 score Train: 0.98
# F1 score Test: 0.84

7. Learning Curves

In [258]:
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : str
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise, it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to be
        big enough to contain at least one sample from each class.

    Returns
    -------
    plt : matplotlib.pyplot object
    """
    plt.figure()
    plt.figure(figsize=(10, 8))
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=-1, train_sizes=train_sizes, scoring='f1')
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt
In [259]:
plot_learning_curve(lr_gscv, 'Leaning Curve for Logistic Regression', X_train, y_train, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[259]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [260]:
plot_learning_curve(dt_gscv, 'Leaning Curve for Decision Tree', X_train, y_train, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[260]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [261]:
plot_learning_curve(rf_gscv, 'Leaning Curve for Random Forest', X_train, y_train, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[261]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [262]:
plot_learning_curve(knn_gscv, 'Leaning Curve for K-Nearest Neighbour', X_train, y_train, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[262]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [263]:
plot_learning_curve(svc_gscv, 'Leaning Curve for Support Vector Classifier', X_train, y_train, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[263]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [264]:
plot_learning_curve(xgb_classifier_gscv, 'Leaning Curve for XGBoost', X_train, y_train, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[264]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [265]:
X_train.head()
Out[265]:
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit VehicleAge TimeBetweenCoverageAndIncident InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 -0.37 -0.83 0.90 -0.10 -0.21 0.91 1.18 0.87 0.16 0.01 -1.53 0.79 0.38 0.37 0.92 -1.14 -1.03 1.56 0.87 0.35 3 0.43 0.32 1 1 0.16 0.59 0.74 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0
1 -1.40 -0.83 0.90 -0.80 -1.14 1.27 -0.51 -1.49 2.25 -1.31 -0.52 0.57 0.90 0.88 0.35 -1.14 -0.29 -1.17 1.56 0.25 4 0.45 0.43 1 3 0.25 0.58 0.60 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0
2 0.92 0.85 0.90 0.75 1.23 -0.82 -0.51 -0.14 -0.89 0.01 0.49 -0.71 -0.79 -0.81 -0.59 -0.16 -0.29 -0.75 -1.40 0.34 4 0.51 0.22 4 1 0.15 0.41 0.61 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0
3 -0.11 -0.83 -1.28 -0.16 1.58 -0.23 -0.51 1.88 1.21 1.33 0.49 0.60 0.93 -0.43 0.71 -0.16 -1.03 1.20 -0.65 0.76 4 0.51 0.48 4 3 0.84 0.46 0.45 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1
4 -1.27 -0.83 -1.49 -0.83 -1.14 0.25 -0.51 0.36 1.21 -1.31 1.50 0.73 -0.09 -0.10 1.06 1.47 1.53 0.03 1.32 0.24 1 0.44 0.30 2 2 0.39 0.57 0.54 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0
In [266]:
X_train.tail()
Out[266]:
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit VehicleAge TimeBetweenCoverageAndIncident InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
29467 -1.01 -0.83 -0.38 -1.07 -0.65 -1.23 -0.51 0.10 1.21 -1.31 1.50 0.25 -0.35 0.33 0.35 1.47 1.53 0.02 -0.63 0.97 4 0.44 0.30 5 2 1.00 0.56 0.35 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0
29468 0.50 -0.83 -1.13 0.55 1.67 0.46 1.78 0.61 0.92 0.19 -0.24 0.32 0.77 0.62 0.10 -0.16 -0.29 -1.33 0.88 0.95 1 0.53 0.55 1 2 0.77 0.50 0.59 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
29469 0.51 1.50 -1.03 0.57 -0.52 0.88 -0.51 0.15 -0.89 -1.03 1.50 -0.19 -0.08 0.28 -0.31 -0.16 -0.29 0.41 -0.58 0.92 2 0.50 0.51 4 2 0.95 0.41 0.47 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0
29470 0.49 1.94 0.90 -0.53 0.42 0.08 -0.51 -0.22 0.39 0.01 -0.74 -0.04 0.08 0.39 -0.18 0.84 0.82 0.28 -0.13 1.00 0 0.59 0.48 0 2 1.00 0.57 0.58 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0
29471 -0.35 -0.83 0.18 -0.14 0.92 0.71 -0.51 -1.32 -0.89 0.91 1.50 -2.01 -1.57 -1.51 -2.04 0.64 0.72 -0.25 1.46 0.40 5 0.50 0.56 2 1 0.41 0.57 0.58 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0
In [267]:
X_test.head()
Out[267]:
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit VehicleAge TimeBetweenCoverageAndIncident InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
0 0.27 -0.83 0.90 0.41 -1.16 -0.78 -0.51 -0.65 -0.89 0.01 0.49 -1.99 -1.50 -1.64 -2.00 -0.16 -0.29 -0.57 -0.14 0.36 4 0.53 0.44 1 1 0.31 0.37 0.30 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0
1 -1.66 -0.83 0.90 -1.07 -0.21 1.73 -0.51 -0.81 1.21 0.01 -0.52 0.75 1.50 -0.25 0.73 -0.16 -0.29 1.39 -1.30 0.29 5 0.43 0.44 5 3 0.15 0.37 0.47 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
2 -0.50 0.33 -0.71 -0.76 -1.16 -0.69 -0.51 1.21 2.25 0.01 0.49 0.17 -0.00 -0.02 0.24 -0.16 -0.29 -0.58 1.06 1.00 0 0.65 0.31 2 1 0.99 0.56 0.56 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0
3 0.14 0.99 -1.16 -0.13 -0.35 -1.30 -0.51 -0.14 -0.89 -1.31 0.49 0.79 0.10 1.51 0.69 1.47 1.53 -0.36 -1.38 0.31 3 0.50 0.43 5 3 0.44 0.50 0.54 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0
4 -0.24 1.08 0.90 -0.53 -0.99 -1.34 -0.51 -1.32 1.21 -1.31 1.50 -0.46 -0.61 -0.63 -0.33 -0.16 -0.29 1.18 -0.27 0.35 0 0.56 0.35 0 1 0.35 0.41 0.56 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
In [268]:
X_test.tail()
Out[268]:
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit VehicleAge TimeBetweenCoverageAndIncident InsuredZipCode InsuredEducationLevel InsuredOccupation InsuredHobbies InsuredRelationship SeverityOfIncident IncidentAddress VehicleMake VehicleModel InsuredGender_MALE InsurancePolicyState_State2 InsurancePolicyState_State3 TypeOfIncident_Parked Car TypeOfIncident_Single Vehicle Collision TypeOfIncident_Vehicle Theft TypeOfCollission_Rear Collision TypeOfCollission_Side Collision AuthoritiesContacted_Fire AuthoritiesContacted_None AuthoritiesContacted_Other AuthoritiesContacted_Police IncidentState_State4 IncidentState_State5 IncidentState_State6 IncidentState_State7 IncidentState_State8 IncidentState_State9 IncidentCity_City2 IncidentCity_City3 IncidentCity_City4 IncidentCity_City5 IncidentCity_City6 IncidentCity_City7 PropertyDamage_YES PoliceReport_YES DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday MonthOfIncident_January MonthOfIncident_March
8646 -1.66 -0.83 -1.13 -1.69 -0.30 -0.13 -0.51 -0.14 1.21 1.33 -1.53 -0.37 -1.66 -0.47 0.02 1.47 -1.03 -0.97 1.56 0.98 2 0.43 0.92 5 2 0.98 0.37 0.47 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
8647 2.33 -0.83 -1.05 2.30 1.67 -0.36 -0.30 1.71 -0.89 -1.31 -1.53 0.11 0.86 0.84 -0.27 1.47 1.53 0.21 -1.18 0.29 5 0.59 0.22 0 3 0.29 0.50 0.59 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0
8648 0.66 -0.83 0.90 0.41 -0.78 0.53 -0.51 -0.81 0.16 0.01 0.49 -0.00 0.66 0.02 -0.17 -1.14 -1.03 -1.55 1.05 0.19 6 0.51 0.32 0 1 0.19 0.46 0.45 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0
8649 -0.11 -0.83 0.90 0.09 -1.16 0.73 -0.51 -0.31 -0.89 -1.31 -0.52 0.69 -0.08 1.52 0.60 -0.16 -0.29 -0.56 -1.27 0.31 1 0.51 0.32 4 3 0.29 0.57 0.65 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
8650 -1.01 1.45 -1.52 -0.52 -0.58 0.40 -0.51 -1.15 -0.89 0.01 -0.52 -2.11 -1.61 -1.68 -2.12 -1.14 -1.03 0.22 -0.96 0.37 4 0.53 0.48 1 1 0.09 0.44 0.35 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0
In [269]:
y_train.shape
Out[269]:
(29472,)
In [270]:
y_test.shape
Out[270]:
(8651,)
In [271]:
X_combined = pd.concat([X_train, X_test], axis=0)

y_combined = np.concatenate([y_train, y_test], axis=0)
In [272]:
X_combined.shape
Out[272]:
(38123, 62)
In [273]:
y_combined.shape
Out[273]:
(38123,)
In [274]:
plot_learning_curve(lr_gscv, 'Leaning Curve for Logistic Regression', X_combined, y_combined, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[274]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [275]:
plot_learning_curve(dt_gscv, 'Leaning Curve for Decision Tree', X_combined, y_combined, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[275]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [276]:
plot_learning_curve(rf_gscv, 'Leaning Curve for Random Forest', X_combined, y_combined, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[276]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [277]:
plot_learning_curve(knn_gscv, 'Leaning Curve for K-Nearest Neighbour', X_combined, y_combined, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[277]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [278]:
plot_learning_curve(svc_gscv, 'Leaning Curve for Support Vector Classifier', X_combined, y_combined, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[278]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>
In [279]:
plot_learning_curve(xgb_classifier_gscv, 'Leaning Curve for XGBoost', X_combined, y_combined, ylim=None, cv=3, train_sizes=np.linspace(.1, 1.0, 5))
Out[279]:
<module 'matplotlib.pyplot' from '/usr/share/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
<Figure size 432x288 with 0 Axes>

8. K-Fold Cross Validation

In [280]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import f1_score

def cross_val_f1_score(model, X, y, cv=5):
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)
    f1_scores = cross_val_score(model, X, y, cv=kf, scoring='f1')
    mean_f1_score = f1_scores.mean()
    std_f1_score = f1_scores.std()
    
    print(mean_f1_score)
    print(std_f1_score)
    
    return None
In [281]:
# Logistic Regression
cross_val_f1_score(lr_gscv, X_combined, y_combined, cv=5)
0.935915814679395
0.0013666091452107154
In [282]:
# Decision Tree
cross_val_f1_score(dt_gscv, X_combined, y_combined, cv=5)
0.9364699542171581
0.0012008995658927533
In [283]:
# Random Forest
cross_val_f1_score(rf_gscv, X_combined, y_combined, cv=5)
0.9381718636506996
0.0013537329987740212
In [284]:
# K-Nearest Neighbour
cross_val_f1_score(knn_gscv, X_combined, y_combined, cv=5)
0.932165977714412
0.00045622746780603774
In [285]:
# K-Nearest Neighbour
cross_val_f1_score(svc_gscv, X_combined, y_combined, cv=5)
0.9418905422496655
0.0011497865897524646
In [286]:
# K-Nearest Neighbour
cross_val_f1_score(xgb_classifier_gscv, X_combined, y_combined, cv=5)
0.9480042192448306
0.0014387331501186998

9. Pattern Extraction

In [254]:
train_demographic = pd.read_csv("../Train Data/Train_Demographics.csv",na_values=['NA'])
train_policy = pd.read_csv("../Train Data/Train_Policy.csv",na_values=['NA', '-1', 'MISSINGVAL'])
train_claim = pd.read_csv("../Train Data/Train_Claim.csv",na_values=['?', '-5', 'MISSINGVALUE', 'MISSEDDATA'])
train_vehicle = pd.read_csv("../Train Data/Train_Vehicle.csv" ,na_values=['???'])
train_target = pd.read_csv("../Train Data/Traindata_with_Target.csv")
In [255]:
# pivot the dataframe to get unique values of VehicleAttribute as columns and VehicleAttributeDetails as the values
train_vehicle = train_vehicle.pivot_table(index='CustomerID', columns='VehicleAttribute', values='VehicleAttributeDetails', aggfunc='first')
train_vehicle.head()
Out[255]:
VehicleAttribute VehicleID VehicleMake VehicleModel VehicleYOM
CustomerID
Cust10000 Vehicle26917 Audi A5 2008
Cust10001 Vehicle15893 Audi A5 2006
Cust10002 Vehicle5152 Volkswagen Jetta 1999
Cust10003 Vehicle37363 Volkswagen Jetta 2003
Cust10004 Vehicle28633 Toyota CRV 2010
In [256]:
# removing the VehicleAttribute as index
train_vehicle = train_vehicle.rename_axis(None, axis=1)
# reset the index to convert the pivot table to a regular dataframe
train_vehicle = train_vehicle.reset_index()
# print the resulting head of the dataframe and its shape
display(train_vehicle.head(10))
print("\n")
print('Shape of train_vehicle:', train_vehicle.shape)
CustomerID VehicleID VehicleMake VehicleModel VehicleYOM
0 Cust10000 Vehicle26917 Audi A5 2008
1 Cust10001 Vehicle15893 Audi A5 2006
2 Cust10002 Vehicle5152 Volkswagen Jetta 1999
3 Cust10003 Vehicle37363 Volkswagen Jetta 2003
4 Cust10004 Vehicle28633 Toyota CRV 2010
5 Cust10005 Vehicle26409 Toyota CRV 2011
6 Cust10006 Vehicle12114 Mercedes C300 2000
7 Cust10007 Vehicle26987 Suburu C300 2010
8 Cust10009 Vehicle12490 Volkswagen Passat 1995
9 Cust1001 Vehicle28516 Saab 92x 2004

Shape of train_vehicle: (28836, 5)
In [257]:
# merge the dataframes based on the CustomerID column
merged_df = pd.merge(train_demographic, train_policy, on='CustomerID')
merged_df = pd.merge(merged_df, train_claim, on='CustomerID')
merged_df = pd.merge(merged_df, train_vehicle, on='CustomerID')
train_df = pd.merge(merged_df, train_target, on='CustomerID')
display(train_df.head(10))
print("\n")
print('Shape of train_df:', train_df.shape)
CustomerID InsuredAge InsuredZipCode InsuredGender InsuredEducationLevel InsuredOccupation InsuredHobbies CapitalGains CapitalLoss Country InsurancePolicyNumber CustomerLoyaltyPeriod DateOfPolicyCoverage InsurancePolicyState Policy_CombinedSingleLimit Policy_Deductible PolicyAnnualPremium UmbrellaLimit InsuredRelationship DateOfIncident TypeOfIncident TypeOfCollission SeverityOfIncident AuthoritiesContacted IncidentState IncidentCity IncidentAddress IncidentTime NumberOfVehicles PropertyDamage BodilyInjuries Witnesses PoliceReport AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage VehicleID VehicleMake VehicleModel VehicleYOM ReportedFraud
0 Cust10000 35 454776 MALE JD armed-forces movies 56700 -48500 India 119121 49 25-10-1998 State1 100/300 1000 1632.73 0 not-in-family 03-02-2015 Multi-vehicle Collision Side Collision Total Loss Police State7 City1 Location 1311 17.00 3 NaN 1 0.00 NaN 65501.00 13417 6071 46013 Vehicle26917 Audi A5 2008 N
1 Cust10001 36 454776 MALE JD tech-support cross-fit 70600 -48500 India 119122 114 15-11-2000 State1 100/300 1000 1255.19 0 not-in-family 02-02-2015 Multi-vehicle Collision Side Collision Total Loss Police State7 City5 Location 1311 10.00 3 YES 2 1.00 YES 61382.00 15560 5919 39903 Vehicle15893 Audi A5 2006 N
2 Cust10002 33 603260 MALE JD armed-forces polo 66400 -63700 India 119123 167 12-02-2001 State3 500/1000 617 1373.38 0 wife 15-01-2015 Single Vehicle Collision Side Collision Minor Damage Other State8 City6 Location 2081 22.00 1 YES 2 3.00 NO 66755.00 11630 11630 43495 Vehicle5152 Volkswagen Jetta 1999 N
3 Cust10003 36 474848 MALE JD armed-forces polo 47900 -73400 India 119124 190 11-04-2005 State2 500/1000 722 1337.60 0 own-child 19-01-2015 Single Vehicle Collision Side Collision Minor Damage Other State9 City6 Location 2081 22.00 1 YES 2 3.00 NO 66243.00 12003 12003 42237 Vehicle37363 Volkswagen Jetta 2003 N
4 Cust10004 29 457942 FEMALE High School exec-managerial dancing 0 -41500 India 119125 115 25-10-1996 State2 100/300 500 1353.73 4279863 unmarried 09-01-2015 Single Vehicle Collision Rear Collision Minor Damage Fire State8 City6 Location 1695 10.00 1 NO 2 1.00 YES 53544.00 8829 7234 37481 Vehicle28633 Toyota CRV 2010 N
5 Cust10005 28 457942 FEMALE High School exec-managerial dancing 0 -41500 India 119126 101 24-10-1999 State2 100/300 500 1334.49 3921366 unmarried 07-02-2015 Single Vehicle Collision Rear Collision Minor Damage Fire State7 City6 Location 1695 7.00 1 NO 1 2.00 NaN 53167.00 7818 8132 37217 Vehicle26409 Toyota CRV 2011 N
6 Cust10006 57 476456 MALE Masters adm-clerical sleeping 67400 0 India 119127 471 18-02-1995 State3 100/300 512 1214.78 165819 own-child 30-01-2015 Single Vehicle Collision Front Collision Minor Damage Ambulance State5 City4 Location 1440 20.00 1 NaN 0 2.00 NO 77453.00 6476 12822 58155 Vehicle12114 Mercedes C300 2000 N
7 Cust10007 49 476456 MALE Masters adm-clerical sleeping 67400 0 India 119128 340 22-02-1993 State3 100/300 877 1159.81 5282219 own-child 12-01-2015 Single Vehicle Collision Front Collision Minor Damage Police State5 City3 Location 1440 18.00 1 NaN 0 2.00 NO 60569.00 5738 7333 47498 Vehicle26987 Suburu C300 2010 N
8 Cust10009 27 432896 FEMALE High School handlers-cleaners camping 56400 -32800 India 119130 81 09-05-1998 State2 500/1000 2000 989.53 0 own-child 06-02-2015 Multi-vehicle Collision Front Collision Minor Damage Ambulance State9 City2 Location 1521 3.00 3 YES 0 0.00 NaN 67876.00 6788 7504 53584 Vehicle12490 Volkswagen Passat 1995 N
9 Cust1001 48 466132 MALE MD craft-repair sleeping 53300 0 India 110122 328 17-10-2014 State3 250/500 1000 1406.91 0 husband 25-01-2015 Single Vehicle Collision Side Collision Major Damage Police State7 City2 Location 1596 5.00 1 YES 1 2.00 YES 71610.00 6510 13020 52080 Vehicle28516 Saab 92x 2004 Y

Shape of train_df: (28836, 42)
In [258]:
# create two new columns by splitting the values column
train_df[['SplitLimit', 'CombinedSingleLimit']] = train_df['Policy_CombinedSingleLimit'].str.split('/', expand=True)

# convert the columns to the appropriate data types if necessary
train_df['SplitLimit'] = train_df['SplitLimit'].astype(int)
train_df['CombinedSingleLimit'] = train_df['CombinedSingleLimit'].astype(int)

# dropping the original column
train_df.drop("Policy_CombinedSingleLimit", axis=1, inplace=True)
In [259]:
# categorical converrsion

# getting categoric columns in ca_cols variable
cat_cols = ['CustomerID', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel',
           'InsuredOccupation', 'InsuredHobbies', 'Country', 'InsurancePolicyNumber',
           'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission',
           'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity',
           'IncidentAddress', 'PropertyDamage', 'PoliceReport', 'VehicleID',
            'VehicleMake', 'VehicleModel', 'ReportedFraud', 'VehicleYOM']

# calling the convert_columns_types_to_category() function defined above
train_df = convert_columns_types_to_category(train_df, cols=cat_cols, col_type = 'category')
'### Before conversion: ###'
CustomerID                object
InsuredAge                 int64
InsuredZipCode             int64
InsuredGender             object
InsuredEducationLevel     object
InsuredOccupation         object
InsuredHobbies            object
CapitalGains               int64
CapitalLoss                int64
Country                   object
InsurancePolicyNumber      int64
CustomerLoyaltyPeriod      int64
DateOfPolicyCoverage      object
InsurancePolicyState      object
Policy_Deductible          int64
PolicyAnnualPremium      float64
UmbrellaLimit              int64
InsuredRelationship       object
DateOfIncident            object
TypeOfIncident            object
TypeOfCollission          object
SeverityOfIncident        object
AuthoritiesContacted      object
IncidentState             object
IncidentCity              object
IncidentAddress           object
IncidentTime             float64
NumberOfVehicles           int64
PropertyDamage            object
BodilyInjuries             int64
Witnesses                float64
PoliceReport              object
AmountOfTotalClaim       float64
AmountOfInjuryClaim        int64
AmountOfPropertyClaim      int64
AmountOfVehicleDamage      int64
VehicleID                 object
VehicleMake               object
VehicleModel              object
VehicleYOM                object
ReportedFraud             object
SplitLimit                 int64
CombinedSingleLimit        int64
dtype: object
'### After conversion: ###'
CustomerID               category
InsuredAge                  int64
InsuredZipCode           category
InsuredGender            category
InsuredEducationLevel    category
InsuredOccupation        category
InsuredHobbies           category
CapitalGains                int64
CapitalLoss                 int64
Country                  category
InsurancePolicyNumber    category
CustomerLoyaltyPeriod       int64
DateOfPolicyCoverage       object
InsurancePolicyState     category
Policy_Deductible           int64
PolicyAnnualPremium       float64
UmbrellaLimit               int64
InsuredRelationship      category
DateOfIncident             object
TypeOfIncident           category
TypeOfCollission         category
SeverityOfIncident       category
AuthoritiesContacted     category
IncidentState            category
IncidentCity             category
IncidentAddress          category
IncidentTime              float64
NumberOfVehicles            int64
PropertyDamage           category
BodilyInjuries              int64
Witnesses                 float64
PoliceReport             category
AmountOfTotalClaim        float64
AmountOfInjuryClaim         int64
AmountOfPropertyClaim       int64
AmountOfVehicleDamage       int64
VehicleID                category
VehicleMake              category
VehicleModel             category
VehicleYOM               category
ReportedFraud            category
SplitLimit                  int64
CombinedSingleLimit         int64
dtype: object
In [260]:
# datetime converrsion

# usning the pandas to_datetime() function we conver the columns into date-time format
train_df['DateOfPolicyCoverage'] = pd.to_datetime(train_df['DateOfPolicyCoverage'], format='%d-%m-%Y')
train_df['DateOfIncident'] = pd.to_datetime(train_df['DateOfIncident'], format='%d-%m-%Y')
In [261]:
# storing the columns which are to be dropped in drop_cols
drop_cols = ['CustomerID', 'Country', 'InsurancePolicyNumber', 'VehicleID', 'DateOfPolicyCoverage', 'DateOfIncident']

# calling the function defined above to drop the columns
train_df = drop_unnecessary_columns(train_df, cols=drop_cols)
'Before Dropping  : '
Index(['CustomerID', 'InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'Country', 'InsurancePolicyNumber', 'CustomerLoyaltyPeriod', 'DateOfPolicyCoverage', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'DateOfIncident', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleID', 'VehicleMake', 'VehicleModel', 'VehicleYOM', 'ReportedFraud', 'SplitLimit', 'CombinedSingleLimit'], dtype='object')
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦-♦
'After Dropping : '
Index(['InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleMake', 'VehicleModel', 'VehicleYOM', 'ReportedFraud', 'SplitLimit', 'CombinedSingleLimit'], dtype='object')
In [262]:
X, y = get_X_y_dataframes(train_df, 'ReportedFraud')
Columns in X : Index(['InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleMake', 'VehicleModel', 'VehicleYOM', 'SplitLimit', 'CombinedSingleLimit'], dtype='object')
shape of X : (28836, 36)
****************************************
shape of y : (28836,)
In [263]:
X_train, X_test, y_train, y_test = perform_only_train_test_split(X,y,test_size=0.3,random_state=1234, stratify = y)
X_train shape:  (20185, 36)
X_test shape:  (8651, 36)
y_train shape:  (20185,)
y_test shape:  (8651,)
In [264]:
InsuredZipCode_map = create_map_dict(X_train, 'InsuredZipCode')
InsuredGender_map = create_map_dict(X_train, 'InsuredGender')
InsuredEducationLevel_map = create_map_dict(X_train, 'InsuredEducationLevel')
InsuredOccupation_map = create_map_dict(X_train, 'InsuredOccupation')
InsuredHobbies_map = create_map_dict(X_train, 'InsuredHobbies')
InsurancePolicyState_map = create_map_dict(X_train, 'InsurancePolicyState')
InsuredRelationship_map = create_map_dict(X_train, 'InsuredRelationship')
TypeOfIncident_map = create_map_dict(X_train, 'TypeOfIncident')
TypeOfCollission_map = create_map_dict(X_train, 'TypeOfCollission')
SeverityOfIncident_map = create_map_dict(X_train, 'SeverityOfIncident')
AuthoritiesContacted_map = create_map_dict(X_train, 'AuthoritiesContacted')
IncidentState_map = create_map_dict(X_train, 'IncidentState')
IncidentCity_map = create_map_dict(X_train, 'IncidentCity')
IncidentAddress_map = create_map_dict(X_train, 'IncidentAddress')
PropertyDamage_map = create_map_dict(X_train, 'PropertyDamage')
PoliceReport_map = create_map_dict(X_train, 'PoliceReport')
VehicleMake_map = create_map_dict(X_train, 'VehicleMake')
VehicleModel_map = create_map_dict(X_train, 'VehicleModel')
VehicleYOM_map = create_map_dict(X_train, 'VehicleYOM')
In [265]:
X_train.loc[:,'InsuredZipCode'] = X_train['InsuredZipCode'].map(InsuredZipCode_map)
X_train.loc[:,'InsuredGender'] = X_train['InsuredGender'].map(InsuredGender_map)
X_train.loc[:,'InsuredEducationLevel'] = X_train['InsuredEducationLevel'].map(InsuredEducationLevel_map)
X_train.loc[:,'InsuredOccupation'] = X_train['InsuredOccupation'].map(InsuredOccupation_map)
X_train.loc[:,'InsuredHobbies'] = X_train['InsuredHobbies'].map(InsuredHobbies_map)
X_train.loc[:,'InsurancePolicyState'] = X_train['InsurancePolicyState'].map(InsurancePolicyState_map)
X_train.loc[:,'InsuredRelationship'] = X_train['InsuredRelationship'].map(InsuredRelationship_map)
X_train.loc[:,'TypeOfIncident'] = X_train['TypeOfIncident'].map(TypeOfIncident_map)
X_train.loc[:,'TypeOfCollission'] = X_train['TypeOfCollission'].map(TypeOfCollission_map)
X_train.loc[:,'SeverityOfIncident'] = X_train['SeverityOfIncident'].map(SeverityOfIncident_map)
X_train.loc[:,'AuthoritiesContacted'] = X_train['AuthoritiesContacted'].map(AuthoritiesContacted_map)
X_train.loc[:,'IncidentState'] = X_train['IncidentState'].map(IncidentState_map)
X_train.loc[:,'IncidentCity'] = X_train['IncidentCity'].map(IncidentCity_map)
X_train.loc[:,'IncidentAddress'] = X_train['IncidentAddress'].map(IncidentAddress_map)
X_train.loc[:,'PropertyDamage'] = X_train['PropertyDamage'].map(PropertyDamage_map)
X_train.loc[:,'PoliceReport'] = X_train['PoliceReport'].map(PoliceReport_map)
X_train.loc[:,'VehicleMake'] = X_train['VehicleMake'].map(VehicleMake_map)
X_train.loc[:,'VehicleModel'] = X_train['VehicleModel'].map(VehicleModel_map)
X_train.loc[:,'VehicleYOM'] = X_train['VehicleYOM'].map(VehicleYOM_map)
In [266]:
X_train.isnull().sum()
Out[266]:
InsuredAge                  0
InsuredZipCode              0
InsuredGender              23
InsuredEducationLevel       0
InsuredOccupation           0
InsuredHobbies              0
CapitalGains                0
CapitalLoss                 0
CustomerLoyaltyPeriod       0
InsurancePolicyState        0
Policy_Deductible           0
PolicyAnnualPremium        99
UmbrellaLimit               0
InsuredRelationship         0
TypeOfIncident              0
TypeOfCollission         3664
SeverityOfIncident          0
AuthoritiesContacted        0
IncidentState               0
IncidentCity                0
IncidentAddress             0
IncidentTime               22
NumberOfVehicles            0
PropertyDamage           7296
BodilyInjuries              0
Witnesses                  34
PoliceReport             6853
AmountOfTotalClaim         36
AmountOfInjuryClaim         0
AmountOfPropertyClaim       0
AmountOfVehicleDamage       0
VehicleMake                36
VehicleModel                0
VehicleYOM                  0
SplitLimit                  0
CombinedSingleLimit         0
dtype: int64
In [267]:
X_test.isnull().sum()
Out[267]:
InsuredAge                  0
InsuredZipCode              0
InsuredGender               7
InsuredEducationLevel       0
InsuredOccupation           0
InsuredHobbies              0
CapitalGains                0
CapitalLoss                 0
CustomerLoyaltyPeriod       0
InsurancePolicyState        0
Policy_Deductible           0
PolicyAnnualPremium        42
UmbrellaLimit               0
InsuredRelationship         0
TypeOfIncident              0
TypeOfCollission         1498
SeverityOfIncident          0
AuthoritiesContacted        0
IncidentState               0
IncidentCity                0
IncidentAddress             0
IncidentTime                9
NumberOfVehicles            0
PropertyDamage           3163
BodilyInjuries              0
Witnesses                  12
PoliceReport             2952
AmountOfTotalClaim         14
AmountOfInjuryClaim         0
AmountOfPropertyClaim       0
AmountOfVehicleDamage       0
VehicleMake                14
VehicleModel                0
VehicleYOM                  0
SplitLimit                  0
CombinedSingleLimit         0
dtype: int64
In [268]:
cols = X_train.columns
imputer_dt = KNNImputer(n_neighbors=1)
imputed_array = imputer_dt.fit_transform(X_train[cols])

X_train_imp = pd.DataFrame(imputed_array, columns = cols)
In [269]:
X_train_imp.isnull().sum()
Out[269]:
InsuredAge               0
InsuredZipCode           0
InsuredGender            0
InsuredEducationLevel    0
InsuredOccupation        0
InsuredHobbies           0
CapitalGains             0
CapitalLoss              0
CustomerLoyaltyPeriod    0
InsurancePolicyState     0
Policy_Deductible        0
PolicyAnnualPremium      0
UmbrellaLimit            0
InsuredRelationship      0
TypeOfIncident           0
TypeOfCollission         0
SeverityOfIncident       0
AuthoritiesContacted     0
IncidentState            0
IncidentCity             0
IncidentAddress          0
IncidentTime             0
NumberOfVehicles         0
PropertyDamage           0
BodilyInjuries           0
Witnesses                0
PoliceReport             0
AmountOfTotalClaim       0
AmountOfInjuryClaim      0
AmountOfPropertyClaim    0
AmountOfVehicleDamage    0
VehicleMake              0
VehicleModel             0
VehicleYOM               0
SplitLimit               0
CombinedSingleLimit      0
dtype: int64
In [270]:
# get column names
column_names = X_train_imp.columns
display("Column names: ", column_names)

# get column indices
for column_name in column_names:
    column_index = X_train_imp.columns.get_loc(column_name)
    display(f"Column '{column_name}' index: {column_index}")
'Column names: '
Index(['InsuredAge', 'InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation', 'InsuredHobbies', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'InsurancePolicyState', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'InsuredRelationship', 'TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'IncidentCity', 'IncidentAddress', 'IncidentTime', 'NumberOfVehicles', 'PropertyDamage', 'BodilyInjuries', 'Witnesses', 'PoliceReport', 'AmountOfTotalClaim', 'AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleMake', 'VehicleModel', 'VehicleYOM', 'SplitLimit', 'CombinedSingleLimit'], dtype='object')
"Column 'InsuredAge' index: 0"
"Column 'InsuredZipCode' index: 1"
"Column 'InsuredGender' index: 2"
"Column 'InsuredEducationLevel' index: 3"
"Column 'InsuredOccupation' index: 4"
"Column 'InsuredHobbies' index: 5"
"Column 'CapitalGains' index: 6"
"Column 'CapitalLoss' index: 7"
"Column 'CustomerLoyaltyPeriod' index: 8"
"Column 'InsurancePolicyState' index: 9"
"Column 'Policy_Deductible' index: 10"
"Column 'PolicyAnnualPremium' index: 11"
"Column 'UmbrellaLimit' index: 12"
"Column 'InsuredRelationship' index: 13"
"Column 'TypeOfIncident' index: 14"
"Column 'TypeOfCollission' index: 15"
"Column 'SeverityOfIncident' index: 16"
"Column 'AuthoritiesContacted' index: 17"
"Column 'IncidentState' index: 18"
"Column 'IncidentCity' index: 19"
"Column 'IncidentAddress' index: 20"
"Column 'IncidentTime' index: 21"
"Column 'NumberOfVehicles' index: 22"
"Column 'PropertyDamage' index: 23"
"Column 'BodilyInjuries' index: 24"
"Column 'Witnesses' index: 25"
"Column 'PoliceReport' index: 26"
"Column 'AmountOfTotalClaim' index: 27"
"Column 'AmountOfInjuryClaim' index: 28"
"Column 'AmountOfPropertyClaim' index: 29"
"Column 'AmountOfVehicleDamage' index: 30"
"Column 'VehicleMake' index: 31"
"Column 'VehicleModel' index: 32"
"Column 'VehicleYOM' index: 33"
"Column 'SplitLimit' index: 34"
"Column 'CombinedSingleLimit' index: 35"
In [271]:
from imblearn.over_sampling import SMOTENC

sm_dt = SMOTENC(random_state=42, categorical_features=[1,2,3,4,5,9,13,14,15,16,17,18,19,20,23,26,31,32,33])

X_train_imp, y_train = sm_dt.fit_resample(X_train_imp, y_train)
In [272]:
reverse_InsuredZipCode_map = {v: k for k, v in InsuredZipCode_map.items()}
reverse_InsuredGender_map = {v: k for k, v in InsuredGender_map.items()}
reverse_InsuredEducationLevel_map = {v: k for k, v in InsuredEducationLevel_map.items()}
reverse_InsuredOccupation_map = {v: k for k, v in InsuredOccupation_map.items()}
reverse_InsuredHobbies_map = {v: k for k, v in InsuredHobbies_map.items()}
reverse_InsurancePolicyState_map = {v: k for k, v in InsurancePolicyState_map.items()}
reverse_InsuredRelationship_map = {v: k for k, v in InsuredRelationship_map.items()}
reverse_TypeOfIncident_map = {v: k for k, v in TypeOfIncident_map.items()}
reverse_TypeOfCollission_map = {v: k for k, v in TypeOfCollission_map.items()}
reverse_SeverityOfIncident_map = {v: k for k, v in SeverityOfIncident_map.items()}
reverse_AuthoritiesContacted_map = {v: k for k, v in AuthoritiesContacted_map.items()}
reverse_IncidentState_map = {v: k for k, v in IncidentState_map.items()}
reverse_IncidentCity_map = {v: k for k, v in IncidentCity_map.items()}
reverse_IncidentAddress_map = {v: k for k, v in IncidentAddress_map.items()}
reverse_PropertyDamage_map = {v: k for k, v in PropertyDamage_map.items()}
reverse_PoliceReport_map = {v: k for k, v in PoliceReport_map.items()}
reverse_VehicleMake_map = {v: k for k, v in VehicleMake_map.items()}
reverse_VehicleModel_map = {v: k for k, v in VehicleModel_map.items()}
reverse_VehicleYOM_map = {v: k for k, v in VehicleYOM_map.items()}
In [273]:
X_train_imp.loc[:,'InsuredZipCode'] = X_train_imp['InsuredZipCode'].map(reverse_InsuredZipCode_map)
X_train_imp.loc[:,'InsuredGender'] = X_train_imp['InsuredGender'].map(reverse_InsuredGender_map)
X_train_imp.loc[:,'InsuredEducationLevel'] = X_train_imp['InsuredEducationLevel'].map(reverse_InsuredEducationLevel_map)
X_train_imp.loc[:,'InsuredOccupation'] = X_train_imp['InsuredOccupation'].map(reverse_InsuredOccupation_map)
X_train_imp.loc[:,'InsuredHobbies'] = X_train_imp['InsuredHobbies'].map(reverse_InsuredHobbies_map)
X_train_imp.loc[:,'InsurancePolicyState'] = X_train_imp['InsurancePolicyState'].map(reverse_InsurancePolicyState_map)
X_train_imp.loc[:,'InsuredRelationship'] = X_train_imp['InsuredRelationship'].map(reverse_InsuredRelationship_map)
X_train_imp.loc[:,'TypeOfIncident'] = X_train_imp['TypeOfIncident'].map(reverse_TypeOfIncident_map)
X_train_imp.loc[:,'TypeOfCollission'] = X_train_imp['TypeOfCollission'].map(reverse_TypeOfCollission_map)
X_train_imp.loc[:,'SeverityOfIncident'] = X_train_imp['SeverityOfIncident'].map(reverse_SeverityOfIncident_map)
X_train_imp.loc[:,'AuthoritiesContacted'] = X_train_imp['AuthoritiesContacted'].map(reverse_AuthoritiesContacted_map)
X_train_imp.loc[:,'IncidentState'] = X_train_imp['IncidentState'].map(reverse_IncidentState_map)
X_train_imp.loc[:,'IncidentCity'] = X_train_imp['IncidentCity'].map(reverse_IncidentCity_map)
X_train_imp.loc[:,'IncidentAddress'] = X_train_imp['IncidentAddress'].map(reverse_IncidentAddress_map)
X_train_imp.loc[:,'PropertyDamage'] = X_train_imp['PropertyDamage'].map(reverse_PropertyDamage_map)
X_train_imp.loc[:,'PoliceReport'] = X_train_imp['PoliceReport'].map(reverse_PoliceReport_map)
X_train_imp.loc[:,'VehicleMake'] = X_train_imp['VehicleMake'].map(reverse_VehicleMake_map)
X_train_imp.loc[:,'VehicleModel'] = X_train_imp['VehicleModel'].map(reverse_VehicleModel_map)
X_train_imp.loc[:,'VehicleYOM'] = X_train_imp['VehicleYOM'].map(reverse_VehicleYOM_map)
In [274]:
X_train_imp.isnull().sum()
Out[274]:
InsuredAge               0
InsuredZipCode           0
InsuredGender            0
InsuredEducationLevel    0
InsuredOccupation        0
InsuredHobbies           0
CapitalGains             0
CapitalLoss              0
CustomerLoyaltyPeriod    0
InsurancePolicyState     0
Policy_Deductible        0
PolicyAnnualPremium      0
UmbrellaLimit            0
InsuredRelationship      0
TypeOfIncident           0
TypeOfCollission         0
SeverityOfIncident       0
AuthoritiesContacted     0
IncidentState            0
IncidentCity             0
IncidentAddress          0
IncidentTime             0
NumberOfVehicles         0
PropertyDamage           0
BodilyInjuries           0
Witnesses                0
PoliceReport             0
AmountOfTotalClaim       0
AmountOfInjuryClaim      0
AmountOfPropertyClaim    0
AmountOfVehicleDamage    0
VehicleMake              0
VehicleModel             0
VehicleYOM               0
SplitLimit               0
CombinedSingleLimit      0
dtype: int64
In [275]:
# categorical converrsion

# getting categoric columns in ca_cols variable
cat_cols = ['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation',
            'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident',
            'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState',
            'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 
             'VehicleMake', 'VehicleModel', 'VehicleYOM']

# calling the convert_columns_types_to_category() function defined above
X_train_imp = convert_columns_types_to_category(X_train_imp, cols=cat_cols, col_type = 'category')
'### Before conversion: ###'
InsuredAge               float64
InsuredZipCode             int64
InsuredGender             object
InsuredEducationLevel     object
InsuredOccupation         object
InsuredHobbies            object
CapitalGains             float64
CapitalLoss              float64
CustomerLoyaltyPeriod    float64
InsurancePolicyState      object
Policy_Deductible        float64
PolicyAnnualPremium      float64
UmbrellaLimit            float64
InsuredRelationship       object
TypeOfIncident            object
TypeOfCollission          object
SeverityOfIncident        object
AuthoritiesContacted      object
IncidentState             object
IncidentCity              object
IncidentAddress           object
IncidentTime             float64
NumberOfVehicles         float64
PropertyDamage            object
BodilyInjuries           float64
Witnesses                float64
PoliceReport              object
AmountOfTotalClaim       float64
AmountOfInjuryClaim      float64
AmountOfPropertyClaim    float64
AmountOfVehicleDamage    float64
VehicleMake               object
VehicleModel              object
VehicleYOM                object
SplitLimit               float64
CombinedSingleLimit      float64
dtype: object
'### After conversion: ###'
InsuredAge                float64
InsuredZipCode           category
InsuredGender            category
InsuredEducationLevel    category
InsuredOccupation        category
InsuredHobbies           category
CapitalGains              float64
CapitalLoss               float64
CustomerLoyaltyPeriod     float64
InsurancePolicyState     category
Policy_Deductible         float64
PolicyAnnualPremium       float64
UmbrellaLimit             float64
InsuredRelationship      category
TypeOfIncident           category
TypeOfCollission         category
SeverityOfIncident       category
AuthoritiesContacted     category
IncidentState            category
IncidentCity             category
IncidentAddress          category
IncidentTime              float64
NumberOfVehicles          float64
PropertyDamage           category
BodilyInjuries            float64
Witnesses                 float64
PoliceReport             category
AmountOfTotalClaim        float64
AmountOfInjuryClaim       float64
AmountOfPropertyClaim     float64
AmountOfVehicleDamage     float64
VehicleMake              category
VehicleModel             category
VehicleYOM               category
SplitLimit                float64
CombinedSingleLimit       float64
dtype: object
In [276]:
X_test.loc[:,'InsuredZipCode'] = X_test['InsuredZipCode'].map(InsuredZipCode_map)
X_test.loc[:,'InsuredGender'] = X_test['InsuredGender'].map(InsuredGender_map)
X_test.loc[:,'InsuredEducationLevel'] = X_test['InsuredEducationLevel'].map(InsuredEducationLevel_map)
X_test.loc[:,'InsuredOccupation'] = X_test['InsuredOccupation'].map(InsuredOccupation_map)
X_test.loc[:,'InsuredHobbies'] = X_test['InsuredHobbies'].map(InsuredHobbies_map)
X_test.loc[:,'InsurancePolicyState'] = X_test['InsurancePolicyState'].map(InsurancePolicyState_map)
X_test.loc[:,'InsuredRelationship'] = X_test['InsuredRelationship'].map(InsuredRelationship_map)
X_test.loc[:,'TypeOfIncident'] = X_test['TypeOfIncident'].map(TypeOfIncident_map)
X_test.loc[:,'TypeOfCollission'] = X_test['TypeOfCollission'].map(TypeOfCollission_map)
X_test.loc[:,'SeverityOfIncident'] = X_test['SeverityOfIncident'].map(SeverityOfIncident_map)
X_test.loc[:,'AuthoritiesContacted'] = X_test['AuthoritiesContacted'].map(AuthoritiesContacted_map)
X_test.loc[:,'IncidentState'] = X_test['IncidentState'].map(IncidentState_map)
X_test.loc[:,'IncidentCity'] = X_test['IncidentCity'].map(IncidentCity_map)
X_test.loc[:,'IncidentAddress'] = X_test['IncidentAddress'].map(IncidentAddress_map)
X_test.loc[:,'PropertyDamage'] = X_test['PropertyDamage'].map(PropertyDamage_map)
X_test.loc[:,'PoliceReport'] = X_test['PoliceReport'].map(PoliceReport_map)
X_test.loc[:,'VehicleMake'] = X_test['VehicleMake'].map(VehicleMake_map)
X_test.loc[:,'VehicleModel'] = X_test['VehicleModel'].map(VehicleModel_map)
X_test.loc[:,'VehicleYOM'] = X_test['VehicleYOM'].map(VehicleYOM_map)
In [277]:
cols = X_test.columns
imputed_array = imputer_dt.transform(X_test[cols])

X_test_imp = pd.DataFrame(imputed_array, columns = cols)
In [278]:
X_test_imp.loc[:,'InsuredZipCode'] = X_test_imp['InsuredZipCode'].map(reverse_InsuredZipCode_map)
X_test_imp.loc[:,'InsuredGender'] = X_test_imp['InsuredGender'].map(reverse_InsuredGender_map)
X_test_imp.loc[:,'InsuredEducationLevel'] = X_test_imp['InsuredEducationLevel'].map(reverse_InsuredEducationLevel_map)
X_test_imp.loc[:,'InsuredOccupation'] = X_test_imp['InsuredOccupation'].map(reverse_InsuredOccupation_map)
X_test_imp.loc[:,'InsuredHobbies'] = X_test_imp['InsuredHobbies'].map(reverse_InsuredHobbies_map)
X_test_imp.loc[:,'InsurancePolicyState'] = X_test_imp['InsurancePolicyState'].map(reverse_InsurancePolicyState_map)
X_test_imp.loc[:,'InsuredRelationship'] = X_test_imp['InsuredRelationship'].map(reverse_InsuredRelationship_map)
X_test_imp.loc[:,'TypeOfIncident'] = X_test_imp['TypeOfIncident'].map(reverse_TypeOfIncident_map)
X_test_imp.loc[:,'TypeOfCollission'] = X_test_imp['TypeOfCollission'].map(reverse_TypeOfCollission_map)
X_test_imp.loc[:,'SeverityOfIncident'] = X_test_imp['SeverityOfIncident'].map(reverse_SeverityOfIncident_map)
X_test_imp.loc[:,'AuthoritiesContacted'] = X_test_imp['AuthoritiesContacted'].map(reverse_AuthoritiesContacted_map)
X_test_imp.loc[:,'IncidentState'] = X_test_imp['IncidentState'].map(reverse_IncidentState_map)
X_test_imp.loc[:,'IncidentCity'] = X_test_imp['IncidentCity'].map(reverse_IncidentCity_map)
X_test_imp.loc[:,'IncidentAddress'] = X_test_imp['IncidentAddress'].map(reverse_IncidentAddress_map)
X_test_imp.loc[:,'PropertyDamage'] = X_test_imp['PropertyDamage'].map(reverse_PropertyDamage_map)
X_test_imp.loc[:,'PoliceReport'] = X_test_imp['PoliceReport'].map(reverse_PoliceReport_map)
X_test_imp.loc[:,'VehicleMake'] = X_test_imp['VehicleMake'].map(reverse_VehicleMake_map)
X_test_imp.loc[:,'VehicleModel'] = X_test_imp['VehicleModel'].map(reverse_VehicleModel_map)
X_test_imp.loc[:,'VehicleYOM'] = X_test_imp['VehicleYOM'].map(reverse_VehicleYOM_map)
In [279]:
X_test_imp.isnull().sum()
Out[279]:
InsuredAge               0
InsuredZipCode           0
InsuredGender            0
InsuredEducationLevel    0
InsuredOccupation        0
InsuredHobbies           0
CapitalGains             0
CapitalLoss              0
CustomerLoyaltyPeriod    0
InsurancePolicyState     0
Policy_Deductible        0
PolicyAnnualPremium      0
UmbrellaLimit            0
InsuredRelationship      0
TypeOfIncident           0
TypeOfCollission         0
SeverityOfIncident       0
AuthoritiesContacted     0
IncidentState            0
IncidentCity             0
IncidentAddress          0
IncidentTime             0
NumberOfVehicles         0
PropertyDamage           0
BodilyInjuries           0
Witnesses                0
PoliceReport             0
AmountOfTotalClaim       0
AmountOfInjuryClaim      0
AmountOfPropertyClaim    0
AmountOfVehicleDamage    0
VehicleMake              0
VehicleModel             0
VehicleYOM               0
SplitLimit               0
CombinedSingleLimit      0
dtype: int64
In [280]:
# categorical converrsion

# getting categoric columns in ca_cols variable
cat_cols = ['InsuredZipCode', 'InsuredGender', 'InsuredEducationLevel', 'InsuredOccupation',
            'InsuredHobbies', 'InsurancePolicyState', 'InsuredRelationship', 'TypeOfIncident',
            'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState',
            'IncidentCity', 'IncidentAddress', 'PropertyDamage', 'PoliceReport', 
             'VehicleMake', 'VehicleModel', 'VehicleYOM']

# calling the convert_columns_types_to_category() function defined above
X_test_imp = convert_columns_types_to_category(X_test_imp, cols=cat_cols, col_type = 'category')
'### Before conversion: ###'
InsuredAge               float64
InsuredZipCode             int64
InsuredGender             object
InsuredEducationLevel     object
InsuredOccupation         object
InsuredHobbies            object
CapitalGains             float64
CapitalLoss              float64
CustomerLoyaltyPeriod    float64
InsurancePolicyState      object
Policy_Deductible        float64
PolicyAnnualPremium      float64
UmbrellaLimit            float64
InsuredRelationship       object
TypeOfIncident            object
TypeOfCollission          object
SeverityOfIncident        object
AuthoritiesContacted      object
IncidentState             object
IncidentCity              object
IncidentAddress           object
IncidentTime             float64
NumberOfVehicles         float64
PropertyDamage            object
BodilyInjuries           float64
Witnesses                float64
PoliceReport              object
AmountOfTotalClaim       float64
AmountOfInjuryClaim      float64
AmountOfPropertyClaim    float64
AmountOfVehicleDamage    float64
VehicleMake               object
VehicleModel              object
VehicleYOM                object
SplitLimit               float64
CombinedSingleLimit      float64
dtype: object
'### After conversion: ###'
InsuredAge                float64
InsuredZipCode           category
InsuredGender            category
InsuredEducationLevel    category
InsuredOccupation        category
InsuredHobbies           category
CapitalGains              float64
CapitalLoss               float64
CustomerLoyaltyPeriod     float64
InsurancePolicyState     category
Policy_Deductible         float64
PolicyAnnualPremium       float64
UmbrellaLimit             float64
InsuredRelationship      category
TypeOfIncident           category
TypeOfCollission         category
SeverityOfIncident       category
AuthoritiesContacted     category
IncidentState            category
IncidentCity             category
IncidentAddress          category
IncidentTime              float64
NumberOfVehicles          float64
PropertyDamage           category
BodilyInjuries            float64
Witnesses                 float64
PoliceReport             category
AmountOfTotalClaim        float64
AmountOfInjuryClaim       float64
AmountOfPropertyClaim     float64
AmountOfVehicleDamage     float64
VehicleMake              category
VehicleModel             category
VehicleYOM               category
SplitLimit                float64
CombinedSingleLimit       float64
dtype: object
In [281]:
X_train_num, X_train_cat = get_num_cat_dataframes(X_train_imp)
X_test_num, X_test_cat = get_num_cat_dataframes(X_test_imp)
(29472, 17)
(29472, 19)
(8651, 17)
(8651, 19)
In [282]:
def one_hot_encode(df, categorical_columns):
    """
    Performs one-hot encoding on a dataframe's categorical features.
    
    Parameters:
    df (pandas.DataFrame): The input dataframe
    categorical_columns (list): List of column names in the dataframe that are categorical
    
    Returns:
    pandas.DataFrame: A new dataframe with the one-hot encoded features
    """
    # Create OneHotEncoder object
    encoder = OneHotEncoder(sparse=False)
    
    # Fit the encoder on the categorical columns
    encoder.fit(df[categorical_columns])
    
    # Get the one-hot encoded features
    one_hot_encoded = encoder.transform(df[categorical_columns])
    
    # Get the column names of the one-hot encoded features
    column_names = encoder.get_feature_names(categorical_columns)
    
    # Create a new dataframe with the one-hot encoded features and column names
    df_one_hot_encoded = pd.DataFrame(one_hot_encoded, columns=column_names)
    
    # Return the new dataframe
    return df_one_hot_encoded, encoder
In [283]:
X_train_cat_encoded, ohe_dt = one_hot_encode(X_train_cat,X_train_cat.columns)
In [284]:
X_train_cat_encoded.shape
Out[284]:
(29472, 2155)
In [285]:
X_test_cat_encoded=pd.DataFrame(ohe_dt.transform(X_test_cat))                                # transforming onehotencoder i.e., perform dummification on categorical columns
X_test_cat_encoded.columns = ohe_dt.get_feature_names(X_test_cat.columns)
In [286]:
X_test_cat_encoded.shape
Out[286]:
(8651, 2155)
In [287]:
X_train_num.reset_index(inplace = True, drop = True)
X_train_num.shape
Out[287]:
(29472, 17)
In [288]:
X_train_cat_encoded.reset_index(inplace = True, drop = True)
X_train_cat_encoded.shape
Out[288]:
(29472, 2155)
In [289]:
X_test_num.reset_index(inplace = True, drop = True)
X_test_num.shape
Out[289]:
(8651, 17)
In [290]:
X_test_cat_encoded.reset_index(inplace = True, drop = True)
X_test_cat_encoded.shape
Out[290]:
(8651, 2155)
In [291]:
X_train = combine_num_df_cat_df(X_train_num, X_train_cat_encoded)
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit InsuredZipCode_430104 InsuredZipCode_430141 InsuredZipCode_430232 InsuredZipCode_430380 InsuredZipCode_430567 InsuredZipCode_430621 InsuredZipCode_430632 InsuredZipCode_430665 InsuredZipCode_430714 InsuredZipCode_430832 InsuredZipCode_430853 InsuredZipCode_430878 InsuredZipCode_430886 InsuredZipCode_430987 InsuredZipCode_431104 InsuredZipCode_431121 InsuredZipCode_431202 InsuredZipCode_431277 InsuredZipCode_431278 InsuredZipCode_431289 InsuredZipCode_431354 InsuredZipCode_431361 InsuredZipCode_431389 InsuredZipCode_431496 InsuredZipCode_431531 InsuredZipCode_431532 InsuredZipCode_431683 InsuredZipCode_431725 InsuredZipCode_431853 InsuredZipCode_431937 InsuredZipCode_431968 InsuredZipCode_432148 InsuredZipCode_432211 InsuredZipCode_432218 InsuredZipCode_432220 InsuredZipCode_432399 InsuredZipCode_432405 InsuredZipCode_432491 InsuredZipCode_432534 InsuredZipCode_432699 InsuredZipCode_432711 InsuredZipCode_432781 InsuredZipCode_432786 InsuredZipCode_432896 InsuredZipCode_432934 InsuredZipCode_433153 InsuredZipCode_433155 InsuredZipCode_433184 InsuredZipCode_433250 InsuredZipCode_433275 InsuredZipCode_433473 InsuredZipCode_433593 InsuredZipCode_433683 InsuredZipCode_433696 InsuredZipCode_433782 InsuredZipCode_433853 InsuredZipCode_433974 InsuredZipCode_433981 InsuredZipCode_434150 InsuredZipCode_434206 InsuredZipCode_434247 InsuredZipCode_434293 InsuredZipCode_434342 InsuredZipCode_434370 InsuredZipCode_434669 InsuredZipCode_434733 InsuredZipCode_434923 InsuredZipCode_434982 InsuredZipCode_435100 InsuredZipCode_435267 InsuredZipCode_435299 InsuredZipCode_435371 InsuredZipCode_435481 InsuredZipCode_435489 InsuredZipCode_435518 InsuredZipCode_435534 InsuredZipCode_435552 InsuredZipCode_435632 InsuredZipCode_435663 InsuredZipCode_435758 InsuredZipCode_435809 InsuredZipCode_435967 InsuredZipCode_435985 InsuredZipCode_435995 InsuredZipCode_436173 InsuredZipCode_436364 InsuredZipCode_436408 InsuredZipCode_436467 InsuredZipCode_436499 InsuredZipCode_436522 InsuredZipCode_436547 InsuredZipCode_436560 InsuredZipCode_436711 InsuredZipCode_436784 InsuredZipCode_436973 InsuredZipCode_436984 InsuredZipCode_437156 InsuredZipCode_437323 InsuredZipCode_437387 InsuredZipCode_437422 InsuredZipCode_437470 InsuredZipCode_437580 InsuredZipCode_437688 InsuredZipCode_437818 InsuredZipCode_437889 InsuredZipCode_437944 InsuredZipCode_438178 InsuredZipCode_438180 InsuredZipCode_438215 InsuredZipCode_438237 InsuredZipCode_438328 InsuredZipCode_438529 InsuredZipCode_438546 InsuredZipCode_438555 InsuredZipCode_438584 InsuredZipCode_438617 InsuredZipCode_438775 InsuredZipCode_438830 InsuredZipCode_438837 InsuredZipCode_438877 InsuredZipCode_438923 InsuredZipCode_439269 InsuredZipCode_439304 InsuredZipCode_439360 InsuredZipCode_439502 InsuredZipCode_439534 InsuredZipCode_439690 InsuredZipCode_439787 InsuredZipCode_439870 InsuredZipCode_439929 InsuredZipCode_439964 InsuredZipCode_440106 InsuredZipCode_440153 InsuredZipCode_440251 InsuredZipCode_440327 InsuredZipCode_440680 InsuredZipCode_440720 InsuredZipCode_440757 InsuredZipCode_440831 InsuredZipCode_440833 InsuredZipCode_440837 InsuredZipCode_440865 InsuredZipCode_440930 InsuredZipCode_440961 InsuredZipCode_441142 InsuredZipCode_441175 InsuredZipCode_441298 InsuredZipCode_441363 InsuredZipCode_441370 InsuredZipCode_441425 InsuredZipCode_441491 InsuredZipCode_441499 InsuredZipCode_441533 InsuredZipCode_441536 InsuredZipCode_441648 InsuredZipCode_441659 InsuredZipCode_441671 InsuredZipCode_441714 InsuredZipCode_441716 InsuredZipCode_441726 InsuredZipCode_441783 InsuredZipCode_441871 InsuredZipCode_441923 InsuredZipCode_441967 InsuredZipCode_441981 InsuredZipCode_441992 InsuredZipCode_442142 InsuredZipCode_442206 InsuredZipCode_442210 InsuredZipCode_442239 InsuredZipCode_442308 InsuredZipCode_442335 InsuredZipCode_442389 InsuredZipCode_442395 InsuredZipCode_442479 InsuredZipCode_442540 InsuredZipCode_442598 InsuredZipCode_442604 InsuredZipCode_442632 InsuredZipCode_442666 InsuredZipCode_442695 InsuredZipCode_442797 InsuredZipCode_442866 InsuredZipCode_442919 InsuredZipCode_442936 InsuredZipCode_442948 InsuredZipCode_443191 InsuredZipCode_443342 InsuredZipCode_443344 InsuredZipCode_443402 InsuredZipCode_443462 InsuredZipCode_443522 InsuredZipCode_443550 InsuredZipCode_443567 InsuredZipCode_443625 InsuredZipCode_443854 InsuredZipCode_443861 InsuredZipCode_443920 InsuredZipCode_444155 InsuredZipCode_444232 InsuredZipCode_444378 InsuredZipCode_444413 InsuredZipCode_444500 InsuredZipCode_444558 InsuredZipCode_444583 InsuredZipCode_444626 InsuredZipCode_444734 InsuredZipCode_444797 InsuredZipCode_444822 InsuredZipCode_444896 InsuredZipCode_444903 InsuredZipCode_444913 InsuredZipCode_444922 InsuredZipCode_445120 InsuredZipCode_445339 InsuredZipCode_445601 InsuredZipCode_445638 InsuredZipCode_445648 InsuredZipCode_445853 InsuredZipCode_445856 InsuredZipCode_445904 InsuredZipCode_446158 InsuredZipCode_446174 InsuredZipCode_446326 InsuredZipCode_446435 InsuredZipCode_446544 InsuredZipCode_446606 InsuredZipCode_446608 InsuredZipCode_446657 InsuredZipCode_446755 InsuredZipCode_446788 InsuredZipCode_446895 InsuredZipCode_446898 ... IncidentAddress_Location 1914 IncidentAddress_Location 1915 IncidentAddress_Location 1916 IncidentAddress_Location 1917 IncidentAddress_Location 1918 IncidentAddress_Location 1919 IncidentAddress_Location 1920 IncidentAddress_Location 1921 IncidentAddress_Location 1922 IncidentAddress_Location 1923 IncidentAddress_Location 1924 IncidentAddress_Location 1926 IncidentAddress_Location 1927 IncidentAddress_Location 1928 IncidentAddress_Location 1929 IncidentAddress_Location 1930 IncidentAddress_Location 1931 IncidentAddress_Location 1934 IncidentAddress_Location 1935 IncidentAddress_Location 1936 IncidentAddress_Location 1937 IncidentAddress_Location 1938 IncidentAddress_Location 1939 IncidentAddress_Location 1940 IncidentAddress_Location 1941 IncidentAddress_Location 1942 IncidentAddress_Location 1943 IncidentAddress_Location 1944 IncidentAddress_Location 1945 IncidentAddress_Location 1947 IncidentAddress_Location 1948 IncidentAddress_Location 1949 IncidentAddress_Location 1950 IncidentAddress_Location 1951 IncidentAddress_Location 1952 IncidentAddress_Location 1953 IncidentAddress_Location 1954 IncidentAddress_Location 1955 IncidentAddress_Location 1956 IncidentAddress_Location 1957 IncidentAddress_Location 1958 IncidentAddress_Location 1959 IncidentAddress_Location 1960 IncidentAddress_Location 1961 IncidentAddress_Location 1962 IncidentAddress_Location 1963 IncidentAddress_Location 1964 IncidentAddress_Location 1965 IncidentAddress_Location 1966 IncidentAddress_Location 1967 IncidentAddress_Location 1968 IncidentAddress_Location 1969 IncidentAddress_Location 1970 IncidentAddress_Location 1971 IncidentAddress_Location 1972 IncidentAddress_Location 1973 IncidentAddress_Location 1974 IncidentAddress_Location 1975 IncidentAddress_Location 1976 IncidentAddress_Location 1977 IncidentAddress_Location 1978 IncidentAddress_Location 1979 IncidentAddress_Location 1981 IncidentAddress_Location 1982 IncidentAddress_Location 1983 IncidentAddress_Location 1985 IncidentAddress_Location 1986 IncidentAddress_Location 1987 IncidentAddress_Location 1988 IncidentAddress_Location 1989 IncidentAddress_Location 1990 IncidentAddress_Location 1991 IncidentAddress_Location 1992 IncidentAddress_Location 1993 IncidentAddress_Location 1994 IncidentAddress_Location 1995 IncidentAddress_Location 1996 IncidentAddress_Location 1997 IncidentAddress_Location 1998 IncidentAddress_Location 1999 IncidentAddress_Location 2000 IncidentAddress_Location 2001 IncidentAddress_Location 2002 IncidentAddress_Location 2003 IncidentAddress_Location 2004 IncidentAddress_Location 2005 IncidentAddress_Location 2006 IncidentAddress_Location 2007 IncidentAddress_Location 2009 IncidentAddress_Location 2010 IncidentAddress_Location 2012 IncidentAddress_Location 2013 IncidentAddress_Location 2014 IncidentAddress_Location 2015 IncidentAddress_Location 2016 IncidentAddress_Location 2017 IncidentAddress_Location 2018 IncidentAddress_Location 2019 IncidentAddress_Location 2020 IncidentAddress_Location 2021 IncidentAddress_Location 2022 IncidentAddress_Location 2023 IncidentAddress_Location 2024 IncidentAddress_Location 2026 IncidentAddress_Location 2027 IncidentAddress_Location 2028 IncidentAddress_Location 2029 IncidentAddress_Location 2030 IncidentAddress_Location 2032 IncidentAddress_Location 2033 IncidentAddress_Location 2034 IncidentAddress_Location 2035 IncidentAddress_Location 2036 IncidentAddress_Location 2037 IncidentAddress_Location 2038 IncidentAddress_Location 2040 IncidentAddress_Location 2041 IncidentAddress_Location 2042 IncidentAddress_Location 2043 IncidentAddress_Location 2044 IncidentAddress_Location 2045 IncidentAddress_Location 2046 IncidentAddress_Location 2047 IncidentAddress_Location 2048 IncidentAddress_Location 2049 IncidentAddress_Location 2051 IncidentAddress_Location 2052 IncidentAddress_Location 2053 IncidentAddress_Location 2054 IncidentAddress_Location 2055 IncidentAddress_Location 2056 IncidentAddress_Location 2057 IncidentAddress_Location 2058 IncidentAddress_Location 2059 IncidentAddress_Location 2060 IncidentAddress_Location 2061 IncidentAddress_Location 2062 IncidentAddress_Location 2065 IncidentAddress_Location 2066 IncidentAddress_Location 2067 IncidentAddress_Location 2068 IncidentAddress_Location 2069 IncidentAddress_Location 2070 IncidentAddress_Location 2071 IncidentAddress_Location 2072 IncidentAddress_Location 2074 IncidentAddress_Location 2075 IncidentAddress_Location 2076 IncidentAddress_Location 2077 IncidentAddress_Location 2078 IncidentAddress_Location 2079 IncidentAddress_Location 2080 IncidentAddress_Location 2081 IncidentAddress_Location 2082 IncidentAddress_Location 2083 IncidentAddress_Location 2084 IncidentAddress_Location 2085 IncidentAddress_Location 2086 IncidentAddress_Location 2087 IncidentAddress_Location 2088 IncidentAddress_Location 2089 IncidentAddress_Location 2090 IncidentAddress_Location 2091 IncidentAddress_Location 2092 IncidentAddress_Location 2093 IncidentAddress_Location 2094 IncidentAddress_Location 2095 IncidentAddress_Location 2096 IncidentAddress_Location 2097 IncidentAddress_Location 2098 IncidentAddress_Location 2099 IncidentAddress_Location 2100 PropertyDamage_NO PropertyDamage_YES PoliceReport_NO PoliceReport_YES VehicleMake_Accura VehicleMake_Audi VehicleMake_BMW VehicleMake_Chevrolet VehicleMake_Dodge VehicleMake_Ford VehicleMake_Honda VehicleMake_Jeep VehicleMake_Mercedes VehicleMake_Nissan VehicleMake_Saab VehicleMake_Suburu VehicleMake_Toyota VehicleMake_Volkswagen VehicleModel_3 Series VehicleModel_92x VehicleModel_93 VehicleModel_95 VehicleModel_A3 VehicleModel_A5 VehicleModel_Accord VehicleModel_C300 VehicleModel_CRV VehicleModel_Camry VehicleModel_Civic VehicleModel_Corolla VehicleModel_E400 VehicleModel_Escape VehicleModel_F150 VehicleModel_Forrestor VehicleModel_Fusion VehicleModel_Grand Cherokee VehicleModel_Highlander VehicleModel_Impreza VehicleModel_Jetta VehicleModel_Legacy VehicleModel_M5 VehicleModel_MDX VehicleModel_ML350 VehicleModel_Malibu VehicleModel_Maxima VehicleModel_Neon VehicleModel_Passat VehicleModel_Pathfinder VehicleModel_RAM VehicleModel_RSX VehicleModel_Silverado VehicleModel_TL VehicleModel_Tahoe VehicleModel_Ultima VehicleModel_Wrangler VehicleModel_X5 VehicleModel_X6 VehicleYOM_1995 VehicleYOM_1996 VehicleYOM_1997 VehicleYOM_1998 VehicleYOM_1999 VehicleYOM_2000 VehicleYOM_2001 VehicleYOM_2002 VehicleYOM_2003 VehicleYOM_2004 VehicleYOM_2005 VehicleYOM_2006 VehicleYOM_2007 VehicleYOM_2008 VehicleYOM_2009 VehicleYOM_2010 VehicleYOM_2011 VehicleYOM_2012 VehicleYOM_2013 VehicleYOM_2014 VehicleYOM_2015
0 36.00 0.00 0.00 194.00 1000.00 1440.36 3392565.00 17.00 2.00 1.00 0.00 73318.00 9164.00 9164.00 54990.00 100.00 300.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 28.00 0.00 0.00 126.00 512.00 1510.11 0.00 3.00 4.00 0.00 1.00 68029.00 11367.00 11367.00 45295.00 100.00 500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00
2 46.00 46100.00 0.00 277.00 1764.00 1098.15 0.00 11.00 1.00 1.00 2.00 37502.00 4167.00 4167.00 29168.00 250.00 500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00
3 38.00 0.00 -60300.00 189.00 1950.00 1214.11 0.00 23.00 3.00 2.00 2.00 68628.00 11510.00 5755.00 51363.00 250.00 300.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 29.00 0.00 -66200.00 124.00 510.00 1308.77 0.00 14.00 3.00 0.00 3.00 71719.00 7172.00 7172.00 57375.00 500.00 1000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

5 rows × 2172 columns

In [292]:
X_test = combine_num_df_cat_df(X_test_num, X_test_cat_encoded)
InsuredAge CapitalGains CapitalLoss CustomerLoyaltyPeriod Policy_Deductible PolicyAnnualPremium UmbrellaLimit IncidentTime NumberOfVehicles BodilyInjuries Witnesses AmountOfTotalClaim AmountOfInjuryClaim AmountOfPropertyClaim AmountOfVehicleDamage SplitLimit CombinedSingleLimit InsuredZipCode_430104 InsuredZipCode_430141 InsuredZipCode_430232 InsuredZipCode_430380 InsuredZipCode_430567 InsuredZipCode_430621 InsuredZipCode_430632 InsuredZipCode_430665 InsuredZipCode_430714 InsuredZipCode_430832 InsuredZipCode_430853 InsuredZipCode_430878 InsuredZipCode_430886 InsuredZipCode_430987 InsuredZipCode_431104 InsuredZipCode_431121 InsuredZipCode_431202 InsuredZipCode_431277 InsuredZipCode_431278 InsuredZipCode_431289 InsuredZipCode_431354 InsuredZipCode_431361 InsuredZipCode_431389 InsuredZipCode_431496 InsuredZipCode_431531 InsuredZipCode_431532 InsuredZipCode_431683 InsuredZipCode_431725 InsuredZipCode_431853 InsuredZipCode_431937 InsuredZipCode_431968 InsuredZipCode_432148 InsuredZipCode_432211 InsuredZipCode_432218 InsuredZipCode_432220 InsuredZipCode_432399 InsuredZipCode_432405 InsuredZipCode_432491 InsuredZipCode_432534 InsuredZipCode_432699 InsuredZipCode_432711 InsuredZipCode_432781 InsuredZipCode_432786 InsuredZipCode_432896 InsuredZipCode_432934 InsuredZipCode_433153 InsuredZipCode_433155 InsuredZipCode_433184 InsuredZipCode_433250 InsuredZipCode_433275 InsuredZipCode_433473 InsuredZipCode_433593 InsuredZipCode_433683 InsuredZipCode_433696 InsuredZipCode_433782 InsuredZipCode_433853 InsuredZipCode_433974 InsuredZipCode_433981 InsuredZipCode_434150 InsuredZipCode_434206 InsuredZipCode_434247 InsuredZipCode_434293 InsuredZipCode_434342 InsuredZipCode_434370 InsuredZipCode_434669 InsuredZipCode_434733 InsuredZipCode_434923 InsuredZipCode_434982 InsuredZipCode_435100 InsuredZipCode_435267 InsuredZipCode_435299 InsuredZipCode_435371 InsuredZipCode_435481 InsuredZipCode_435489 InsuredZipCode_435518 InsuredZipCode_435534 InsuredZipCode_435552 InsuredZipCode_435632 InsuredZipCode_435663 InsuredZipCode_435758 InsuredZipCode_435809 InsuredZipCode_435967 InsuredZipCode_435985 InsuredZipCode_435995 InsuredZipCode_436173 InsuredZipCode_436364 InsuredZipCode_436408 InsuredZipCode_436467 InsuredZipCode_436499 InsuredZipCode_436522 InsuredZipCode_436547 InsuredZipCode_436560 InsuredZipCode_436711 InsuredZipCode_436784 InsuredZipCode_436973 InsuredZipCode_436984 InsuredZipCode_437156 InsuredZipCode_437323 InsuredZipCode_437387 InsuredZipCode_437422 InsuredZipCode_437470 InsuredZipCode_437580 InsuredZipCode_437688 InsuredZipCode_437818 InsuredZipCode_437889 InsuredZipCode_437944 InsuredZipCode_438178 InsuredZipCode_438180 InsuredZipCode_438215 InsuredZipCode_438237 InsuredZipCode_438328 InsuredZipCode_438529 InsuredZipCode_438546 InsuredZipCode_438555 InsuredZipCode_438584 InsuredZipCode_438617 InsuredZipCode_438775 InsuredZipCode_438830 InsuredZipCode_438837 InsuredZipCode_438877 InsuredZipCode_438923 InsuredZipCode_439269 InsuredZipCode_439304 InsuredZipCode_439360 InsuredZipCode_439502 InsuredZipCode_439534 InsuredZipCode_439690 InsuredZipCode_439787 InsuredZipCode_439870 InsuredZipCode_439929 InsuredZipCode_439964 InsuredZipCode_440106 InsuredZipCode_440153 InsuredZipCode_440251 InsuredZipCode_440327 InsuredZipCode_440680 InsuredZipCode_440720 InsuredZipCode_440757 InsuredZipCode_440831 InsuredZipCode_440833 InsuredZipCode_440837 InsuredZipCode_440865 InsuredZipCode_440930 InsuredZipCode_440961 InsuredZipCode_441142 InsuredZipCode_441175 InsuredZipCode_441298 InsuredZipCode_441363 InsuredZipCode_441370 InsuredZipCode_441425 InsuredZipCode_441491 InsuredZipCode_441499 InsuredZipCode_441533 InsuredZipCode_441536 InsuredZipCode_441648 InsuredZipCode_441659 InsuredZipCode_441671 InsuredZipCode_441714 InsuredZipCode_441716 InsuredZipCode_441726 InsuredZipCode_441783 InsuredZipCode_441871 InsuredZipCode_441923 InsuredZipCode_441967 InsuredZipCode_441981 InsuredZipCode_441992 InsuredZipCode_442142 InsuredZipCode_442206 InsuredZipCode_442210 InsuredZipCode_442239 InsuredZipCode_442308 InsuredZipCode_442335 InsuredZipCode_442389 InsuredZipCode_442395 InsuredZipCode_442479 InsuredZipCode_442540 InsuredZipCode_442598 InsuredZipCode_442604 InsuredZipCode_442632 InsuredZipCode_442666 InsuredZipCode_442695 InsuredZipCode_442797 InsuredZipCode_442866 InsuredZipCode_442919 InsuredZipCode_442936 InsuredZipCode_442948 InsuredZipCode_443191 InsuredZipCode_443342 InsuredZipCode_443344 InsuredZipCode_443402 InsuredZipCode_443462 InsuredZipCode_443522 InsuredZipCode_443550 InsuredZipCode_443567 InsuredZipCode_443625 InsuredZipCode_443854 InsuredZipCode_443861 InsuredZipCode_443920 InsuredZipCode_444155 InsuredZipCode_444232 InsuredZipCode_444378 InsuredZipCode_444413 InsuredZipCode_444500 InsuredZipCode_444558 InsuredZipCode_444583 InsuredZipCode_444626 InsuredZipCode_444734 InsuredZipCode_444797 InsuredZipCode_444822 InsuredZipCode_444896 InsuredZipCode_444903 InsuredZipCode_444913 InsuredZipCode_444922 InsuredZipCode_445120 InsuredZipCode_445339 InsuredZipCode_445601 InsuredZipCode_445638 InsuredZipCode_445648 InsuredZipCode_445853 InsuredZipCode_445856 InsuredZipCode_445904 InsuredZipCode_446158 InsuredZipCode_446174 InsuredZipCode_446326 InsuredZipCode_446435 InsuredZipCode_446544 InsuredZipCode_446606 InsuredZipCode_446608 InsuredZipCode_446657 InsuredZipCode_446755 InsuredZipCode_446788 InsuredZipCode_446895 InsuredZipCode_446898 ... IncidentAddress_Location 1914 IncidentAddress_Location 1915 IncidentAddress_Location 1916 IncidentAddress_Location 1917 IncidentAddress_Location 1918 IncidentAddress_Location 1919 IncidentAddress_Location 1920 IncidentAddress_Location 1921 IncidentAddress_Location 1922 IncidentAddress_Location 1923 IncidentAddress_Location 1924 IncidentAddress_Location 1926 IncidentAddress_Location 1927 IncidentAddress_Location 1928 IncidentAddress_Location 1929 IncidentAddress_Location 1930 IncidentAddress_Location 1931 IncidentAddress_Location 1934 IncidentAddress_Location 1935 IncidentAddress_Location 1936 IncidentAddress_Location 1937 IncidentAddress_Location 1938 IncidentAddress_Location 1939 IncidentAddress_Location 1940 IncidentAddress_Location 1941 IncidentAddress_Location 1942 IncidentAddress_Location 1943 IncidentAddress_Location 1944 IncidentAddress_Location 1945 IncidentAddress_Location 1947 IncidentAddress_Location 1948 IncidentAddress_Location 1949 IncidentAddress_Location 1950 IncidentAddress_Location 1951 IncidentAddress_Location 1952 IncidentAddress_Location 1953 IncidentAddress_Location 1954 IncidentAddress_Location 1955 IncidentAddress_Location 1956 IncidentAddress_Location 1957 IncidentAddress_Location 1958 IncidentAddress_Location 1959 IncidentAddress_Location 1960 IncidentAddress_Location 1961 IncidentAddress_Location 1962 IncidentAddress_Location 1963 IncidentAddress_Location 1964 IncidentAddress_Location 1965 IncidentAddress_Location 1966 IncidentAddress_Location 1967 IncidentAddress_Location 1968 IncidentAddress_Location 1969 IncidentAddress_Location 1970 IncidentAddress_Location 1971 IncidentAddress_Location 1972 IncidentAddress_Location 1973 IncidentAddress_Location 1974 IncidentAddress_Location 1975 IncidentAddress_Location 1976 IncidentAddress_Location 1977 IncidentAddress_Location 1978 IncidentAddress_Location 1979 IncidentAddress_Location 1981 IncidentAddress_Location 1982 IncidentAddress_Location 1983 IncidentAddress_Location 1985 IncidentAddress_Location 1986 IncidentAddress_Location 1987 IncidentAddress_Location 1988 IncidentAddress_Location 1989 IncidentAddress_Location 1990 IncidentAddress_Location 1991 IncidentAddress_Location 1992 IncidentAddress_Location 1993 IncidentAddress_Location 1994 IncidentAddress_Location 1995 IncidentAddress_Location 1996 IncidentAddress_Location 1997 IncidentAddress_Location 1998 IncidentAddress_Location 1999 IncidentAddress_Location 2000 IncidentAddress_Location 2001 IncidentAddress_Location 2002 IncidentAddress_Location 2003 IncidentAddress_Location 2004 IncidentAddress_Location 2005 IncidentAddress_Location 2006 IncidentAddress_Location 2007 IncidentAddress_Location 2009 IncidentAddress_Location 2010 IncidentAddress_Location 2012 IncidentAddress_Location 2013 IncidentAddress_Location 2014 IncidentAddress_Location 2015 IncidentAddress_Location 2016 IncidentAddress_Location 2017 IncidentAddress_Location 2018 IncidentAddress_Location 2019 IncidentAddress_Location 2020 IncidentAddress_Location 2021 IncidentAddress_Location 2022 IncidentAddress_Location 2023 IncidentAddress_Location 2024 IncidentAddress_Location 2026 IncidentAddress_Location 2027 IncidentAddress_Location 2028 IncidentAddress_Location 2029 IncidentAddress_Location 2030 IncidentAddress_Location 2032 IncidentAddress_Location 2033 IncidentAddress_Location 2034 IncidentAddress_Location 2035 IncidentAddress_Location 2036 IncidentAddress_Location 2037 IncidentAddress_Location 2038 IncidentAddress_Location 2040 IncidentAddress_Location 2041 IncidentAddress_Location 2042 IncidentAddress_Location 2043 IncidentAddress_Location 2044 IncidentAddress_Location 2045 IncidentAddress_Location 2046 IncidentAddress_Location 2047 IncidentAddress_Location 2048 IncidentAddress_Location 2049 IncidentAddress_Location 2051 IncidentAddress_Location 2052 IncidentAddress_Location 2053 IncidentAddress_Location 2054 IncidentAddress_Location 2055 IncidentAddress_Location 2056 IncidentAddress_Location 2057 IncidentAddress_Location 2058 IncidentAddress_Location 2059 IncidentAddress_Location 2060 IncidentAddress_Location 2061 IncidentAddress_Location 2062 IncidentAddress_Location 2065 IncidentAddress_Location 2066 IncidentAddress_Location 2067 IncidentAddress_Location 2068 IncidentAddress_Location 2069 IncidentAddress_Location 2070 IncidentAddress_Location 2071 IncidentAddress_Location 2072 IncidentAddress_Location 2074 IncidentAddress_Location 2075 IncidentAddress_Location 2076 IncidentAddress_Location 2077 IncidentAddress_Location 2078 IncidentAddress_Location 2079 IncidentAddress_Location 2080 IncidentAddress_Location 2081 IncidentAddress_Location 2082 IncidentAddress_Location 2083 IncidentAddress_Location 2084 IncidentAddress_Location 2085 IncidentAddress_Location 2086 IncidentAddress_Location 2087 IncidentAddress_Location 2088 IncidentAddress_Location 2089 IncidentAddress_Location 2090 IncidentAddress_Location 2091 IncidentAddress_Location 2092 IncidentAddress_Location 2093 IncidentAddress_Location 2094 IncidentAddress_Location 2095 IncidentAddress_Location 2096 IncidentAddress_Location 2097 IncidentAddress_Location 2098 IncidentAddress_Location 2099 IncidentAddress_Location 2100 PropertyDamage_NO PropertyDamage_YES PoliceReport_NO PoliceReport_YES VehicleMake_Accura VehicleMake_Audi VehicleMake_BMW VehicleMake_Chevrolet VehicleMake_Dodge VehicleMake_Ford VehicleMake_Honda VehicleMake_Jeep VehicleMake_Mercedes VehicleMake_Nissan VehicleMake_Saab VehicleMake_Suburu VehicleMake_Toyota VehicleMake_Volkswagen VehicleModel_3 Series VehicleModel_92x VehicleModel_93 VehicleModel_95 VehicleModel_A3 VehicleModel_A5 VehicleModel_Accord VehicleModel_C300 VehicleModel_CRV VehicleModel_Camry VehicleModel_Civic VehicleModel_Corolla VehicleModel_E400 VehicleModel_Escape VehicleModel_F150 VehicleModel_Forrestor VehicleModel_Fusion VehicleModel_Grand Cherokee VehicleModel_Highlander VehicleModel_Impreza VehicleModel_Jetta VehicleModel_Legacy VehicleModel_M5 VehicleModel_MDX VehicleModel_ML350 VehicleModel_Malibu VehicleModel_Maxima VehicleModel_Neon VehicleModel_Passat VehicleModel_Pathfinder VehicleModel_RAM VehicleModel_RSX VehicleModel_Silverado VehicleModel_TL VehicleModel_Tahoe VehicleModel_Ultima VehicleModel_Wrangler VehicleModel_X5 VehicleModel_X6 VehicleYOM_1995 VehicleYOM_1996 VehicleYOM_1997 VehicleYOM_1998 VehicleYOM_1999 VehicleYOM_2000 VehicleYOM_2001 VehicleYOM_2002 VehicleYOM_2003 VehicleYOM_2004 VehicleYOM_2005 VehicleYOM_2006 VehicleYOM_2007 VehicleYOM_2008 VehicleYOM_2009 VehicleYOM_2010 VehicleYOM_2011 VehicleYOM_2012 VehicleYOM_2013 VehicleYOM_2014 VehicleYOM_2015
0 41.00 0.00 0.00 244.00 500.00 1105.67 0.00 8.00 1.00 1.00 2.00 7013.00 1169.00 584.00 5260.00 250.00 500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 26.00 0.00 0.00 100.00 1000.00 1601.50 0.00 7.00 3.00 1.00 1.00 72236.00 13927.00 6530.00 51779.00 250.00 500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 35.00 31900.00 -44600.00 130.00 500.00 1123.82 0.00 19.00 4.00 1.00 2.00 58405.00 7525.00 7525.00 43355.00 250.00 500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 40.00 50000.00 -56900.00 191.00 929.00 1002.83 0.00 11.00 1.00 0.00 2.00 73103.00 7964.00 14060.00 51079.00 500.00 1000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 37.00 52300.00 0.00 153.00 590.00 994.37 0.00 4.00 3.00 0.00 3.00 43552.00 4934.00 4934.00 33684.00 250.00 500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

5 rows × 2172 columns

In [293]:
from sklearn.preprocessing import LabelEncoder
label_encoder_pattern = LabelEncoder()

label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
In [294]:
dt_pattern_gscv = DecisionTreeClassifier(criterion='gini', max_depth=6)
In [523]:
from sklearn.tree import DecisionTreeClassifier
dt_pattern = DecisionTreeClassifier() 
# Define the models to evaluate
result = fit_and_evaluate_classification("Decision Tree for Pattern Extraction", dt_pattern, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)

display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91
8 Support Vector Classifier 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
9 Support Vector Classifier Grid Search 0.97 0.98 0.95 0.97 0.97 0.94 0.94 0.84 0.89 0.91
10 XGBoost Classifier 0.97 0.99 0.96 0.97 0.97 0.94 0.93 0.85 0.89 0.91
11 XGBoost Classifier Grid Search 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
12 Hard Voting 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
13 Soft Voting 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
14 Decision Tree for Pattern Extraction 1.00 1.00 1.00 1.00 1.00 0.85 0.71 0.76 0.74 0.82
15 Decision Tree for Pattern Extraction with gscv 0.86 0.86 0.86 0.86 0.86 0.83 0.66 0.75 0.70 0.80
16 Decision Tree for Pattern Extraction 1.00 1.00 1.00 1.00 1.00 0.86 0.73 0.76 0.75 0.83
In [524]:
from sklearn.tree import DecisionTreeClassifier
dt_pattern_gscv = DecisionTreeClassifier(criterion='gini', max_depth=6) 
# Define the models to evaluate
result = fit_and_evaluate_classification("Decision Tree for Pattern Extraction with gscv", dt_pattern_gscv, X_train, X_test, y_train, y_test)
new_row = pd.DataFrame({'Model Name': [result[0]], 
                        'Accuracy (Train)': [result[2]], 
                        'Precision (Train)': [result[3]], 
                        'Recall (Train)': [result[4]], 
                        'F1 Score (Train)': [result[5]], 
                        'ROC AUC Score (Train)': [result[6]], 
                        'Accuracy (Test)': [result[7]], 
                        'Precision (Test)': [result[8]], 
                        'Recall (Test)': [result[9]], 
                        'F1 Score (Test)': [result[10]], 
                        'ROC AUC Score (Test)': [result[11]]})

results_df = pd.concat([results_df, new_row], ignore_index=True)

display(results_df)
Model Name Accuracy (Train) Precision (Train) Recall (Train) F1 Score (Train) ROC AUC Score (Train) Accuracy (Test) Precision (Test) Recall (Test) F1 Score (Test) ROC AUC Score (Test)
0 LR 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
1 Logistic Regression Grid Seach 0.95 0.98 0.91 0.94 0.95 0.94 0.94 0.85 0.89 0.91
2 Decision Tree 1.00 1.00 1.00 1.00 1.00 0.87 0.74 0.81 0.77 0.85
3 Decision Tree Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.93 0.85 0.89 0.91
4 Random Forest Classifier 1.00 1.00 1.00 1.00 1.00 0.94 0.94 0.85 0.89 0.91
5 Random Forest Classifier Grid Search 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
6 K-Nearest Neighbour Classifier 0.96 0.97 0.94 0.96 0.96 0.93 0.90 0.85 0.87 0.91
7 K-Nearest Neighbour Classifier Grid Search 1.00 1.00 1.00 1.00 1.00 0.93 0.89 0.85 0.87 0.91
8 Support Vector Classifier 0.95 0.98 0.92 0.95 0.95 0.94 0.94 0.85 0.89 0.91
9 Support Vector Classifier Grid Search 0.97 0.98 0.95 0.97 0.97 0.94 0.94 0.84 0.89 0.91
10 XGBoost Classifier 0.97 0.99 0.96 0.97 0.97 0.94 0.93 0.85 0.89 0.91
11 XGBoost Classifier Grid Search 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
12 Hard Voting 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
13 Soft Voting 0.96 0.98 0.94 0.96 0.96 0.94 0.94 0.85 0.89 0.91
14 Decision Tree for Pattern Extraction 1.00 1.00 1.00 1.00 1.00 0.85 0.71 0.76 0.74 0.82
15 Decision Tree for Pattern Extraction with gscv 0.86 0.86 0.86 0.86 0.86 0.83 0.66 0.75 0.70 0.80
16 Decision Tree for Pattern Extraction 1.00 1.00 1.00 1.00 1.00 0.86 0.73 0.76 0.75 0.83
17 Decision Tree for Pattern Extraction with gscv 0.84 0.87 0.81 0.84 0.84 0.84 0.69 0.74 0.72 0.81
In [525]:
# Extract the top 20 important features
importances = dt_pattern_gscv.feature_importances_
indices = np.argsort(importances)[::-1]
top_features = indices[:20]

# Extract the top 20 rules
def extract_rules(tree, feature_names, class_names):
    rules = []
    def recurse(node, rule):
        if tree.feature[node] != -2:
            feature = feature_names[tree.feature[node]]
            threshold = tree.threshold[node]
            left_rule = rule + " AND " + feature + " <= " + str(threshold)
            recurse(tree.children_left[node], left_rule)
            right_rule = rule + " AND " + feature + " > " + str(threshold)
            recurse(tree.children_right[node], right_rule)
        else:
            class_idx = np.argmax(tree.value[node])
            class_name = class_names[class_idx]
            rules.append((rule, class_name))
    recurse(0, "")
    return rules

rules = extract_rules(dt_pattern_gscv.tree_, X_train.columns, [0,1])

# Rank the rules based on their importance
ranked_rules = sorted(rules, key=lambda x: x[0])

# Select the top 20 rules
top_20_rules = [rule[0] for rule in ranked_rules[:20]]
In [534]:
# Define the file path and name
file_path = "./top_rules.txt"

# Open the file in write mode
with open(file_path, "w") as file:

    # Write each rule to the file with a heading
    for i, rule in enumerate(top_20_rules):
        file.write("Rule {}: {}\n".format(i+1, rule))

# Confirm that the file was saved
print("The top 20 rules were saved to:", file_path)
The top 20 rules were saved to: ./top_rules.txt
In [ ]: